On Thu, Jun 23, 2011 at 7:31 PM, Charles R Harris <charlesr.har...@gmail.com > wrote:
> On Thu, Jun 23, 2011 at 6:21 PM, Mark Wiebe <mwwi...@gmail.com> wrote: > >> On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith <n...@pobox.com> wrote: >> >>> On Thu, Jun 23, 2011 at 2:44 PM, Robert Kern <robert.k...@gmail.com> >>> wrote: >>> > On Thu, Jun 23, 2011 at 15:53, Mark Wiebe <mwwi...@gmail.com> wrote: >>> >> Enthought has asked me to look into the "missing data" problem and how >>> NumPy >>> >> could treat it better. I've considered the different ideas of adding >>> dtype >>> >> variants with a special signal value and masked arrays, and concluded >>> that >>> >> adding masks to the core ndarray appears is the best way to deal with >>> the >>> >> problem in general. >>> >> I've written a NEP that proposes a particular design, viewable here: >>> >> >>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst >>> >> There are some questions at the bottom of the NEP which definitely >>> need >>> >> discussion to find the best design choices. Please read, and let me >>> know of >>> >> all the errors and gaps you find in the document. >>> > >>> > One thing that could use more explanation is how your proposal >>> > improves on the status quo, i.e. numpy.ma. As far as I can see, you >>> > are mostly just shuffling around the functionality that already >>> > exists. There has been a continual desire for something like R's NA >>> > values by people who are very familiar with both R and numpy's masked >>> > arrays. Both have their uses, and as Nathaniel points out, R's >>> > approach seems to be very well-liked by a lot of users. In essence, >>> > *that's* the "missing data problem" that you were charged with: making >>> > happy the users who are currently dissatisfied with masked arrays. It >>> > doesn't seem to me that moving the functionality from numpy.ma to >>> > numpy.ndarray resolves any of their issues. >>> >>> Speaking as a user who's avoided numpy.ma, it wasn't actually because >>> of the behavior I pointed out (I never got far enough to notice it), >>> but because I got the distinct impression that it was a "second-class >>> citizen" in numpy-land. I don't know if that's true. But I wasn't sure >>> how solidly things like interactions between numpy and masked arrays >>> worked, or how , and it seemed like it had more niche uses. So it just >>> seemed like more hassle than it was worth for my purposes. Moving it >>> into the core and making it really solid *would* address these >>> issues... >>> >> >> These are definitely things I'm trying to address. >> >> It does have to be solid, though. It occurs to me on further thought >>> that one major advantage of having first-class "NA" values is that it >>> preserves the standard looping idioms: >>> >>> for i in xrange(len(x)): >>> x[i] = np.log(x[i]) >>> >>> According to the current proposal, this will blow up, but np.log(x) >>> will work. That seems suboptimal to me. >>> >> >> This boils down to the choice between None and a zero-dimensional array as >> the return value of 'x[i]'. This, and the desire that 'x[i] == x[i]' should >> be False if it's a masked value have convinced me that a zero-dimensional >> array is the way to go, and your example will work with this choice. >> >> >>> >>> I do find the argument that we want a general solution compelling. I >>> suppose we could have a magic "NA" value in Python-land which >>> magically triggers fiddling with the mask when assigned to numpy >>> arrays. >>> >>> It's should also be possible to accomplish a general solution at the >>> dtype level. We could have a 'dtype factory' used like: >>> np.zeros(10, dtype=np.maybe(float)) >>> where np.maybe(x) returns a new dtype whose storage size is x.itemsize >>> + 1, where the extra byte is used to store missingness information. >>> (There might be some annoying alignment issues to deal with.) Then for >>> each ufunc we define a handler for the maybe dtype (or add a >>> special-case to the ufunc dispatch machinery) that checks the >>> missingness value and then dispatches to the ordinary ufunc handler >>> for the wrapped dtype. >>> >> >> The 'dtype factory' idea builds on the way I've structured datetime as a >> parameterized type, but the thing that kills it for me is the alignment >> problems of 'x.itemsize + 1'. Having the mask in a separate memory block is >> a lot better than having to store 16 bytes for an 8-byte int to preserve the >> alignment. >> > > Yes, but that assumes it is appended to the existing types in the dtype > individually instead of the dtype as a whole. The dtype with mask could just > indicate a shadow array, an alpha channel if you will, that is essentially > what you are already doing but just probide a different place to track it. > This would seem to change the definition of a dtype - currently it represents a contiguous block of memory. It doesn't need to use all of that memory, but the dtype conceptually owns it. I kind of like it that way, where the whole strides idea with data being all over memory space belonging to ndarray, not dtype. -Mark > This would require fixing the issue where ufunc inner loops can't >>> actually access the dtype object, but we should fix that anyway :-). >>> >> >> Certainly true! >> >> > Chuck > >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion