Re: [Numpy-discussion] Missing data again

Mark Wiebe Sat, 03 Mar 2012 13:46:34 -0800

On Sat, Mar 3, 2012 at 12:30 PM, Travis Oliphant <tra...@continuum.io>wrote:


> <snip>
>


> First of all, I want to be clear that I think there is much great work
> that has been done in the current missing data code.  There are some nice
> features in the where clause of the ufunc and the machinery for the
> iterator that allows re-using ufunc loops that are not re-written to check
> for missing data.   I'm sure there are other things as well that I'm not
> quite aware of yet.    However, I don't think the API presented to the
> numpy user presently is the correct one for NumPy 1.X.
>

I thought I might chime in with some implementation-detail notes, as while
Travis has dug into the code, I'm still the person who knows it best.

A few particulars:
>
>        * the reduction operations need to default to "skipna" --- this is
> the most common use case which has been re-inforced again to me today by a
> new user to Python who is using masked arrays presently
>

This is a completely trivial change. I went with the default as I did
because it's what R, the primary inspiration for the NA design, does. We'll
have to be sure this is well-marked in the documentation about "NumPy NA
for R users".


>        * the mask needs to be visible to the user if they use that
> approach to missing data (people should be able to get a hold of the mask
> and work with it in Python)
>

This is relatively easy. Probably the way to do it is with an
ndarray.maskna property. It could be in 1.7 if we really push. For the
multi-NA future, I think the NPY_MASK dtype, currently an alias for
NPY_UBYTE, would need to become its own dtype with separate .exposed and
.payload attributes.


>        * bit-pattern approaches to missing data (at least for float64 and
> int32) need to be implemented.
>

I strongly wanted to do masks first, because of the greater generality and
because the bit-patterns would best be implemented sharing mask
implementation details. I still believe this was the correct choice, and it
set the stage for bit-patterns. It will be possible to make inner loops
that specialize for the default hard-coded bit-pattern dtypes. I paid very
careful attention in the design making sure high performance is possible
without significant rework. The immense scale of the required code changes
meant I couldn't actually implement high performance in the time frame.

The place I think this affects 1.7 the most is in the default choice for
what np.array([1.0, np.NA, 3.0]) and np.array([1, np.NA, 3]) mean. In 1.7,
both mean an NA-masked array. In 1.8, I can see a strong case that the
first should mean an NA-dtype, and the second an NA-masked array.

Also, here's a thought for the usability of NA-float64. As much as global
state is a bad idea, something which determines whether implicit float
dtypes are NA-float64 or float64 could help. In IPython, "pylab" mode would
default to float64, and "statlab" or "pystat" would default to NA-float64.
One way to write this might be:

>>> np.set_default_float(np.nafloat64)
>>> np.array([1.0, 2.0, 3.0])
array([ 1.,  2.,  3.], dtype=nafloat64)
>>> np.set_default_float(np.float64)
>>> np.array([1.0, 2.0, 3.0])
array([ 1.,  2.,  3.], dtype=float64)


>        * there should be some way when using "masks" (even if it's hidden
> from most users) for missing data to separate the low-level ufunc operation
> from the operation
>           on the masks...
>

This is completely trivial to implement. Maybe
ndarray.view(maskna='ignore') is a reasonable way to spell direct access
without a mask.

Cheers,
Mark


> I have heard from several users that they will *not use the missing data*
> in NumPy as currently implemented, and I can now see why.    For better or
> for worse, my approach to software is generally very user-driven and very
> pragmatic.  On the other hand, I'm also a mathematician and appreciate the
> cognitive compression that can come out of well-formed structure.
>  None-the-less, I'm an *applied* mathematician and am ultimately motivated
> by applications.
>
> I will get a hold of the NEP and spend some time with it to discuss some
> of this in that document.   This will take several weeks (as PyCon is next
> week and I have a tutorial I'm giving there).    For now, I do not think
> 1.7 can be released unless the masked array is labeled *experimental*.
>
> Thanks,
>
> -Travis
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Missing data again

Reply via email to