On Fri, Jun 24, 2011 at 4:09 PM, Benjamin Root <[email protected]> wrote:
> > > On Fri, Jun 24, 2011 at 10:40 AM, Mark Wiebe <[email protected]> wrote: > >> On Thu, Jun 23, 2011 at 7:56 PM, Benjamin Root <[email protected]> wrote: >> >>> On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM <[email protected]> wrote: >>> >>>> Sorry y'all, I'm just commenting bits by bits: >>>> >>>> "One key problem is a lack of orthogonality with other features, for >>>> instance creating a masked array with physical quantities can't be done >>>> because both are separate subclasses of ndarray. The only reasonable way to >>>> deal with this is to move the mask into the core ndarray." >>>> >>>> Meh. I did try to make it easy to use masked arrays on top of >>>> subclasses. There's even some tests in the suite to that effect >>>> (test_subclassing). I'm not buying the argument. >>>> About moving mask in the core ndarray: I had suggested back in the days >>>> to have a mask flag/property built-in ndarrays (which would *really* have >>>> simplified the game), but this suggestion was dismissed very quickly as >>>> adding too much overload. I had to agree. I'm just a tad surprised the wind >>>> has changed on that matter. >>>> >>>> >>>> "In the current masked array, calculations are done for the whole array, >>>> then masks are patched up afterwords. This means that invalid calculations >>>> sitting in masked elements can raise warnings or exceptions even though >>>> they >>>> shouldn't, so the ufunc error handling mechanism can't be relied on." >>>> >>>> Well, there's a reason for that. Initially, I tried to guess what the >>>> mask of the output should be from the mask of the inputs, the objective >>>> being to avoid getting NaNs in the C array. That was easy in most cases, >>>> but it turned out it wasn't always possible (the `power` one caused me a >>>> lot of issues, if I recall correctly). So, for performance issues (to avoid >>>> a lot of expensive tests), I fell back on the old concept of "compute them >>>> all, they'll be sorted afterwards". >>>> Of course, that's rather clumsy an approach. But it works not too badly >>>> when in pure Python. No doubt that a proper C implementation would work >>>> faster. >>>> Oh, about using NaNs for invalid data ? Well, can't work with integers. >>>> >>>> `mask` property: >>>> Nothing to add to it. It's basically what we have now (except for the >>>> opposite convention). >>>> >>>> Working with masked values: >>>> I recall some strong points back in the days for not using None to >>>> represent missing values... >>>> Adding a maskedstr argument to array2string ? Mmh... I prefer a global >>>> flag like we have now. >>>> >>>> Design questions: >>>> Adding `masked` or whatever we call it to a number/array should result >>>> is masked/a fully masked array, period. That way, we can have an idea that >>>> something was wrong with the initial dataset. >>>> hardmask: I never used the feature myself. I wonder if anyone did. >>>> Still, it's a nice idea... >>>> >>> >>> As a heavy masked_array user, I regret not being able to participate more >>> in this discussion as I am madly cranking out matplotlib code. I would like >>> to say that I have always seen masked arrays as being the "next step up" >>> from using arrays with NaNs. The hardmask/softmask/sharedmasked concepts >>> are powerful, and I don't think they have yet to be exploited to their >>> fullest potential. >>> >> >> Do you have some examples where hardmask or sharedmask are being used? I >> like the idea of using a hardmask array as the return value for boolean >> indexing, but some more use cases would be nice. >> >> > > At one point I did have something for soft/hard masks, but I think my final > implementation went a different direction. I would have to look around. I > do have a good use-case for soft masks. For a given data, I wanted to > produce several pcolors highlighting different regions. A soft mask > provided me a quick-n-easy way to change the mask without having to produce > many copies of the original data. > That sounds cool, matplotlib will be a good place to do test modifications while I'm doing the implementation. > Masks are (relatively) easy when dealing with element-by-element operations >>> that produces an array of the same shape (or at least the same number of >>> elements in the case of reshape and transpose). What gets difficult is for >>> reductions such as sum or max, etc. Then you get into the weirder cases >>> such as unwrap and gradients that I brought up recently. I am not sure how >>> to address this, but I am not a fan of the idea of adding yet another >>> parameter to the ufuncs to determine what to do for filling in a mask. >>> >> >> It looks like in R there is a parameter called na.rm=T/F, which basically >> means "remove NAs before doing the computation". This approach seems good to >> me for reduction operations. >> >> > Just to throw out some examples where these settings really do not make > much sense. For gradients and unwrap, maybe you want to skip na's, but > still record the number of points you are skipping or maybe the points at > na-boundaries become na's themselves. Are we going to have something for > each one of these possibilities? Of course, this isn't even very well dealt > with in masked arrays right now. > Yeah, for some functions dealing with NA values will need individual per-function care. Probably they should raise by default until NA support is implemented for them. Another example of how we use masks in matplotlib is in pcolor(). We have > to combine the possible masks of X, Y, and V in both the x and y directions > to find the final mask to use for the final output result (because each > facet needs valid data at each corner). Having a soft-mask implementation > allows one to create a temporary mask to use for the operation, and to share > that mask across all the input data, but then let the data structures retain > their original masks when done. > I will look at the implementation. > Also, just to make things messier, there is an incomplete feature that was >>> made for record arrays with regards to masking. The idea was to allow for >>> element-by-element masking, but also allow for row-by-row (or was it >>> column-by-column?) masking. I thought it was a neat feature, and it is too >>> bad that it was not finished. >>> >> >> I put this in my design, I think this would be useful too. I would call it >> field by field, though many people like thinking of the struct dtype fields >> as columns. >> >> > Fields are fine. I have found that there is no real consistency with how > professionals refer to their rows and columns as "records" and "fields". I > learned data-handling from working on databases, but my naming convention > often clashes with my some of my committee members who come from a stats > background. > I prefer considering them like C structs, which is why I've started calling them "struct dtypes". That name is also shorter than "structured dtypes". > Anyway, my opinion is that a mask should be True for a value that needs >>> to be hidden. Do not change this convention. People coming into python >>> already has to change code, a simple bit flip for them should be fine. >>> Breaking existing python code is worse. >>> >> >> I'm now thinking the mask needs to be pushed away into the background to >> where it becomes be an unimportant implementation detail of the system. It >> deserves a long cumbersome name like "validitymask", and then the system can >> use something close R's approach with an NA-like singleton for most >> operations. >> > > Don't lose sight that we are really talking about two orthogonal (albeit, > seemingly similar) concepts. "missing" data and "ambiguous" data. Both of > these tools need to be at the forefront and the distinction needs to be made > clear to the users so that they know which one they need in what situation. > I think hiding masks is a bad idea. I want numpy to be *better* than R by > offering both features in a clear, non-conflicting manner. > That sounds good to me, we'll have to go through several design iterations to shake out the details. On a note somewhat similar to what I pointing out earlier with regards to > soft masks. One thing that is very nice about masked_arrays is that I can > at any time turn a regular numpy array into a masked array without paying a > penalty of having to re-assign the data. Just need to make a separate mask > object. > I believe my design sufficiently allows for this. This is different from how one would operate with a na-dtype approach, where > converting an array with a regular dtype into a na-dtype array would require > a copy. However, with proper dtype-handling, this may not be of much > concern (non-na-dtype + na-dtype --> na-dtype, much like how int + float --> > float). Also loading functions could be told to cast to a na-dtype, which > would then result in an array that is ready "out-of-the-box" as opposed to > casting the masked array after the creation of the regular ndarray from a > function like np.loadtxt(). > The syntax for casting each element of a struct dtype to a new struct dtype with all na-dtypes would be clumsy at first, there's a bunch of things that would have to be figured out to make that all play nicely. > Again, there are pros and cons either way and I see them very orthogonal > and complementary. Heck, I could even imagine situations where one might > want a mask over an array with a na-dtype. > Maybe, but I'm kind of liking the idea of both of these use cases being handled by the same underlying mechanism. I've updated the NEP, and will let it bake for a bit I think. -Mark > > Ben Root > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
