On Wed, Jul 6, 2011 at 7:34 PM, Nathaniel Smith <[email protected]> wrote:
> Well, everyone seems to like my first attempt at this so far, so I > guess I'll really stick my foot in it now... here's my second miniNEP, > which lays out a plan for handling dtype/bit-pattern-style NAs. I've > stolen bits of text from both the NEP and the alterNEP for this, but > since the focus is on nailing down the details, most of the content is > new. > > There are many FIXME's noted, where some decisions or more work is > needed... the idea here is to lay out some specifics, so we can figure > out if the idea will work and get the details right. So feedback is > *very* welcome! > > Master version: > https://gist.github.com/1068264 > > Current version for commenting: > > ####################################### > miniNEP 2: NA support via special dtypes > ####################################### > > To try and make more progress on the whole missing values/masked > arrays/... debate, it seems useful to have a more technical discussion > of the pieces which we *can* agree on. This is the second, which > attempts to nail down the details of how NAs can be implemented using > special dtype's. > > ***************** > Table of contents > ***************** > > .. contents:: > > ********* > Rationale > ********* > > An ordinary value is something like an integer or a floating point > number. A missing value is a placeholder for an ordinary value that is > for some reason unavailable. For example, in working with statistical > data, we often build tables in which each row represents one item, and > each column represents properties of that item. For instance, we might > take a group of people and for each one record height, age, education > level, and income, and then stick these values into a table. But then > we discover that our research assistant screwed up and forgot to > record the age of one of our individuals. We could throw out the rest > of their data as well, but this would be wasteful; even such an > incomplete row is still perfectly usable for some analyses (e.g., we > can compute the correlation of height and income). The traditional way > to handle this would be to stick some particular meaningless value in > for the missing data, e.g., recording this person's age as 0. But this > is very error prone; we may later forget about these special values > while running other analyses, and discover to our surprise that babies > have higher incomes than teenagers. (In this case, the solution would > be to just leave out all the items where we have no age recorded, but > this isn't a general solution; many analyses require something more > clever to handle missing values.) So instead of using an ordinary > value like 0, we define a special "missing" value, written "NA" for > "not available". > > There are several possible ways to represent such a value in memory. > For instance, we could reserve a specific value (like 0, or a > particular NaN, or the smallest negative integer) and then ensure that > this value is treated specially by all arithmetic and other operations > on our array. Another option would be to add an additional mask array > next to our main array, use this to indicate which values should be > treated as NA, and then extend our array operations to check this mask > array whenever performing computations. Each implementation approach > has various strengths and weaknesses, but here we focus on the former > (value-based) approach exclusively and leave the possible addition of > the latter to future discussion. The core advantages of this approach > are (1) it adds no additional memory overhead, (2) it is > straightforward to store and retrieve such arrays to disk using > existing file storage formats, (3) it allows binary compatibility with > R arrays including NA values, (4) it is compatible with the common > practice of using NaN to indicate missingness when working with > floating point numbers, (5) the dtype is already a place where `weird > things can happen' -- there are a wide variety of dtypes that don't > act like ordinary numbers (including structs, Python objects, > fixed-length strings, ...), so code that accepts arbitrary numpy > arrays already has to be prepared to handle these (even if only by > checking for them and raising an error). Therefore adding yet more new > dtypes has less impact on extension authors than if we change the > ndarray object itself. > > The basic semantics of NA values are as follows. Like any other value, > they must be supported by your array's dtype -- you can't store a > floating point number in an array with dtype=int32, and you can't > store an NA in it either. You need an array with dtype=NAint32 or > something (exact syntax to be determined). Otherwise, NA values act > exactly like any other values. In particular, you can apply arithmetic > functions and so forth to them. By default, any function which takes > an NA as an argument always returns an NA as well, regardless of the > values of the other arguments. This ensures that if we try to compute > the correlation of income with age, we will get "NA", meaning "given > that some of the entries could be anything, the answer could be > anything as well". This reminds us to spend a moment thinking about > how we should rephrase our question to be more meaningful. And as a > convenience for those times when you do decide that you just want the > correlation between the known ages and income, then you can enable > this behavior by adding a single argument to your function call. > > For floating point computations, NAs and NaNs have (almost?) identical > behavior. But they represent different things -- NaN an invalid > computation like 0/0, NA a value that is not available -- and > distinguishing between these things is useful because in some > situations they should be treated differently. (For example, an > imputation procedure should replace NAs with imputed values, but > probably should leave NaNs alone.) And anyway, we can't use NaNs for > integers, or strings, or booleans, so we need NA anyway, and once we > have NA support for all these types, we might as well support it for > floating point too for consistency. > > **************** > General strategy > **************** > > Numpy already has a general mechanism for defining new dtypes and > slotting them in so that they're supported by ndarrays, by the casting > machinery, by ufuncs, and so on. In principle, we could implement > Well, actually not in any useful sense, take a look at what Mark went through for the half floats. There is a reason the NEP went with parametrized dtypes and masks. But we would sure welcome a plan and code to make it true, it is one of the areas that could really use improvement. <snip> Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
