On Wed, Jul 6, 2011 at 7:34 PM, Nathaniel Smith <[email protected]> wrote:

> Well, everyone seems to like my first attempt at this so far, so I
> guess I'll really stick my foot in it now... here's my second miniNEP,
> which lays out a plan for handling dtype/bit-pattern-style NAs. I've
> stolen bits of text from both the NEP and the alterNEP for this, but
> since the focus is on nailing down the details, most of the content is
> new.
>
> There are many FIXME's noted, where some decisions or more work is
> needed... the idea here is to lay out some specifics, so we can figure
> out if the idea will work and get the details right. So feedback is
> *very* welcome!
>
> Master version:
>  https://gist.github.com/1068264
>
> Current version for commenting:
>
> #######################################
> miniNEP 2: NA support via special dtypes
> #######################################
>
> To try and make more progress on the whole missing values/masked
> arrays/... debate, it seems useful to have a more technical discussion
> of the pieces which we *can* agree on. This is the second, which
> attempts to nail down the details of how NAs can be implemented using
> special dtype's.
>
> *****************
> Table of contents
> *****************
>
> .. contents::
>
> *********
> Rationale
> *********
>
> An ordinary value is something like an integer or a floating point
> number. A missing value is a placeholder for an ordinary value that is
> for some reason unavailable. For example, in working with statistical
> data, we often build tables in which each row represents one item, and
> each column represents properties of that item. For instance, we might
> take a group of people and for each one record height, age, education
> level, and income, and then stick these values into a table. But then
> we discover that our research assistant screwed up and forgot to
> record the age of one of our individuals. We could throw out the rest
> of their data as well, but this would be wasteful; even such an
> incomplete row is still perfectly usable for some analyses (e.g., we
> can compute the correlation of height and income). The traditional way
> to handle this would be to stick some particular meaningless value in
> for the missing data, e.g., recording this person's age as 0. But this
> is very error prone; we may later forget about these special values
> while running other analyses, and discover to our surprise that babies
> have higher incomes than teenagers. (In this case, the solution would
> be to just leave out all the items where we have no age recorded, but
> this isn't a general solution; many analyses require something more
> clever to handle missing values.) So instead of using an ordinary
> value like 0, we define a special "missing" value, written "NA" for
> "not available".
>
> There are several possible ways to represent such a value in memory.
> For instance, we could reserve a specific value (like 0, or a
> particular NaN, or the smallest negative integer) and then ensure that
> this value is treated specially by all arithmetic and other operations
> on our array. Another option would be to add an additional mask array
> next to our main array, use this to indicate which values should be
> treated as NA, and then extend our array operations to check this mask
> array whenever performing computations. Each implementation approach
> has various strengths and weaknesses, but here we focus on the former
> (value-based) approach exclusively and leave the possible addition of
> the latter to future discussion. The core advantages of this approach
> are (1) it adds no additional memory overhead, (2) it is
> straightforward to store and retrieve such arrays to disk using
> existing file storage formats, (3) it allows binary compatibility with
> R arrays including NA values, (4) it is compatible with the common
> practice of using NaN to indicate missingness when working with
> floating point numbers, (5) the dtype is already a place where `weird
> things can happen' -- there are a wide variety of dtypes that don't
> act like ordinary numbers (including structs, Python objects,
> fixed-length strings, ...), so code that accepts arbitrary numpy
> arrays already has to be prepared to handle these (even if only by
> checking for them and raising an error). Therefore adding yet more new
> dtypes has less impact on extension authors than if we change the
> ndarray object itself.
>
> The basic semantics of NA values are as follows. Like any other value,
> they must be supported by your array's dtype -- you can't store a
> floating point number in an array with dtype=int32, and you can't
> store an NA in it either. You need an array with dtype=NAint32 or
> something (exact syntax to be determined). Otherwise, NA values act
> exactly like any other values. In particular, you can apply arithmetic
> functions and so forth to them. By default, any function which takes
> an NA as an argument always returns an NA as well, regardless of the
> values of the other arguments. This ensures that if we try to compute
> the correlation of income with age, we will get "NA", meaning "given
> that some of the entries could be anything, the answer could be
> anything as well". This reminds us to spend a moment thinking about
> how we should rephrase our question to be more meaningful. And as a
> convenience for those times when you do decide that you just want the
> correlation between the known ages and income, then you can enable
> this behavior by adding a single argument to your function call.
>
> For floating point computations, NAs and NaNs have (almost?) identical
> behavior. But they represent different things -- NaN an invalid
> computation like 0/0, NA a value that is not available -- and
> distinguishing between these things is useful because in some
> situations they should be treated differently. (For example, an
> imputation procedure should replace NAs with imputed values, but
> probably should leave NaNs alone.) And anyway, we can't use NaNs for
> integers, or strings, or booleans, so we need NA anyway, and once we
> have NA support for all these types, we might as well support it for
> floating point too for consistency.
>
> ****************
> General strategy
> ****************
>
> Numpy already has a general mechanism for defining new dtypes and
> slotting them in so that they're supported by ndarrays, by the casting
> machinery, by ufuncs, and so on. In principle, we could implement
>

Well, actually not in any useful sense, take a look at what Mark went
through for the half floats. There is a reason the NEP went with
parametrized dtypes and masks. But we would sure welcome a plan and code to
make it true, it is one of the areas that could really use improvement.

<snip>

Chuck
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to