On Wed, Nov 2, 2011 at 6:37 PM, Nathaniel Smith <[email protected]> wrote:
> Hi again, > > Okay, here's my attempt at an *uncontroversial* email! > > Specifically, I think it'll be easier to talk about this NA stuff if > we can establish some common ground, and easier for people to follow > if the basic points of agreement are laid out in one place. So I'm > going to try and summarize just the things that we can agree about. > > Note that right now I'm *only* talking about what kind of tools we > want to give the user -- i.e., what kind of problems we are trying to > solve. AFAICT we don't have as much consensus on implementation > matters, and anyway it's hard to make implementation decisions without > knowing what we're trying to accomplish. > > 1) I think we have consensus that there are (at least) two different > possible ways of thinking about this problem, with somewhat different > constituencies. Let's call these two concepts "MISSING data" and > "IGNORED data". > > 2) I also think we have at least a rough consensus on what these > concepts mean, and what their supporters want from them: > > MISSING data: > - Conceptually, MISSINGness acts like a property of a datum -- > assigning MISSING to a location is like assigning any other value to > that location > - Ufuncs and other operations must propagate these values by default, > and there must be an option to cause them to be ignored > - Must be competitive with NaNs in terms of speed and memory usage (or > else people will just use NaNs) > - Compatibility with R is valuable > - To avoid user confusion, ideally it should *not* be possible to > 'unmask' a missing value, since this is inconsistent with the "missing > value" metaphor (e.g., see Wes's comment about "leaky abstractions") > - Possible useful extension: having different classes of missing > values (similar to Stata) > - Target audience: data analysis with missing data, neuroimaging, > econometrics, former R users, ... > > IGNORED data: > - Conceptually, IGNOREDness acts like a property of the array -- > toggling a location to be IGNORED is kind of vaguely similar to > changing an array's shape > - Ufuncs and other operations must ignore these values by default, and > there doesn't really need to be a way to propagate them, even as an > option (though it probably wouldn't hurt either) > - Some memory overhead is inevitable and acceptable > - Compatibility with R neither possible nor valuable > - Ability to toggle the IGNORED state of a location is critical, and > should be as convenient as possible > - Possible useful extension: having not just different types of > ignored values, but richer ways to combine them -- e.g., the example > of combining astronomical images with some kind of associated > per-pixel quality scores, where one might want the 'mask' to be not > just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a > multi-byte integer) or even a float, and to allow these 'masks' to be > combined in some more complex way than just logical_and. > - Target audience: anyone who's already doing this kind of thing by > hand using a second mask array + boolean indexing, former numpy.ma > users, matplotlib, ... > > 3) And perhaps we can all agree that the biggest *un*resolved question > is whether we want to: > - emphasize the similarities between these two use cases and build a > single interface that can handle both concepts, with some compromises > - or, treat these at two mostly-separate features that can each become > exactly what the respective constituency wants without compromise -- > but with some potential redundancy and extra code. > Each approach has advantages and disadvantages. > > Does that seem like a fair summary? Anything more we can add? Most > importantly, anything here that you disagree with? Did I summarize > your needs well? Do you have a use case that you feel doesn't fit > naturally into either category? > > [Also, I thought this might make the start of a good wiki page for > people to reference during these discussions, but I don't seem to have > edit rights. If other people agree, maybe someone could put it up, or > give me access? My trac id is [email protected].] > > Thanks, > -- Nathaniel > I want to pare this down even more. I think the above lists makes too many unneeded extrapolations. MISSING data: - Conceptually, MISSINGness acts like a property of a datum -- assigning MISSING to a location is like assigning any other value to that location - Ufuncs and other operations must propagate these values by default, and there must be an option to cause them to be ignored - Assigning MISSING is destructive - Must be competitive with NaNs in terms of speed and memory usage (or else people will just use NaNs) - Target audience: data analysis with missing data, neuroimaging, econometrics, former R users, ... - Possible useful extension: having different classes of missing values (similar to Stata) IGNORED data: - Conceptually, IGNOREDness acts like a property of the array -- toggling a location to be IGNORED is kind of vaguely similar to changing an array's shape - Ufuncs and other operations must ignore these values by default, and there doesn't really need to be a way to propagate them, even as an option (though it probably wouldn't hurt either) - Assigning IGNORE is non-destructive - Must be competitive with np.ma for speed and memory (or else users would just use np.ma) - Target audience: anyone who's already doing this kind of thing by hand using a second mask array + boolean indexing, former numpy.ma users, matplotlib, ... - Possible useful extension: having not just different types of ignored values, but richer ways to combine them -- e.g., the example of combining astronomical images with some kind of associated per-pixel quality scores, where one might want the 'mask' to be not just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a multi-byte integer) or even a float, and to allow these 'masks' to be combined in some more complex way than just logical_and. Then, as a third-party module developer, I can tell you that having separate and independent ways to detect "MISSING"/"IGNORED" would likely make support more difficult and would greatly benefit from a common (or easily combinable) method of identification. Ben Root P.S. - I took out the phrase "compatibility with R" not as a slight against R, but because of the vagueness of the statement. Does it mean raw binary data format compatibility? Some sort of ABI compatibility (does R or python have the ability to call and pass data to each other?). Rather, I find the declaration of R-users being the target audience *much* more important and allows for more flexibility in achieving that goal for both forms of data.
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
