Re: [Numpy-discussion] gist gist: 1068264

Bruce Southey Sun, 10 Jul 2011 19:33:22 -0700

On Fri, Jul 8, 2011 at 5:04 PM, Nathaniel Smith <[email protected]> wrote:
> Hi Bruce,
>
> I'm replying on the list instead of on github, to make it easier for
> others to join in the discussion if they want. [For those joining in:
> this was a comment posted at https://gist.github.com/1068264 ]
>
> On Fri, Jul 8, 2011 at 10:36 AM, bsouthey wrote:
>> I presume missing float values could be addressed with one of the 'special' 
>> ranges such as 'Indeterminate' in IEEE 754 
>> (http://babbage.cs.qc.edu/IEEE-754/References.xhtml). The outcome should be 
>> determined by the IEEE special operations.
>
> Right. An IEEE 754 double has IIRC about 2^53 distinct bit-patterns
> that all mean "not a number". A few of these are used to signal
> different invalid operations:
>
> In [20]: hex(np.asarray([np.nan]).view(dtype=np.uint64)[0])
> Out[20]: '0x7ff8000000000000L'
> In [21]: hex(np.log([0]).view(dtype=np.uint64)[0])
> Out[21]: '0xfff0000000000000L'
> In [22]: hex(np.divide([0.], [0,]).view(dtype=np.uint64)[0])
> Out[22]: '0xfff8000000000000L'
>
> ...but that only accounts for, like, 10 of the 2^53 or something. The
> rest are simply unused. So what R does, and what we would do for
> dtype-style NAs, is just pick one of those (ideally the same one R
> uses), and declare that that is *not* not a number; it's NA.
>
>> So my real concern is handling integer arrays:
>> 1) How will you find where the missing values are in an array? If there is a 
>> variable that denotes missing values are present (NA_flags?) then do you 
>> have to duplicate code to avoid this searching when an array has no missing 
>> values?
>
> Each dtype has a bunch of C functions associated with it that say how
> to do comparisons, assignment, etc. In the miniNEP design, we add a
> new function to this list called 'isna', which every dtype that wants
> to support NAs has to define.


Starting to lose me here because you are adding memory that your
miniNep was not meant to do.

>
> Yes, this does mean that code which wants to treat NAs separately has
> to check for and call this function if it's present, but that seems to
> be inevitable... *all* of the dtype C functions are supposedly
> optional, so we have to check for them before calling them and do
> something sensible if they aren't defined. We could define a wrapper
> that calls the function if its defined, or else just fills the
> provided buffer with zeros (to mean "there are no NAs), and then code
> which wanted to avoid a special case could use that. But in general we
> probably do want to handle arrays that might have NAs differently from
> arrays which don't have NAs, because if there are no NAs present then
> it's quicker to skip the handling altogether. That's true for any NA
> implementation.

Second problem is that we need memory for at least a new function. We
also have code duplication that needs to be in sync.

>
>> 2) What happens if a normal operation equates to that value: If you use 
>> max(np.int8), such as when adding 1 to an array with an element of 126 or 
>> when overflow occurs:
>>>>> np.arange(120,127, dtype=np.int8)+2
>> array([ 122,  123,  124,  125,  126,  127, -128], dtype=int8)
>> The -128 corresponds to the missing element but is the second to last 
>> element now missing? This is worse if the overflow is larger.
>
> Yeah, in the design as written, overflow (among other things) can
> create accidental NAs. Which kind of sucks. There are a few options:
>
> -- Just live with it.

Unfortunately that is impossible and other choice words.

>
> -- We could add a flag like NPY_NA_AUTO_CHECK, and when this flag is
> set, the ufunc loop runs 'isna' on its output buffer before returning.
> If there are any NAs there that did not arise from NAs in the input,
> then it raises an error. (The reason we would want to make it a flag
> is that this checking is pointless for dtypes like NA-string, and
> mostly pointless for dtypes like NA-float.) Also, we'd only want to
> enable this if we were using the NPY_NA_AUTO_UFUNC ufunc-delegation
> logic, because if you registered a special ufunc loop *specifically
> for your NA-dtype*, then presumably it knows what it's doing. This
> would also allow such an NA-dtype-specific ufunc loop to return NAs on
> purpose if it wanted to.

This appears to me as masking. But my issue here is the complexity of
the function involved because ensuring that the calculation is correct
probably comes with a large performance penalty.

>
> -- Use a dtype that adds a separate flag next to the actual integer to
> indicate NA-ness, instead of stealing one of the integer's values. So
> your NA-int8 would actually be 2 bytes, where the first byte was 1 to
> indicate NA, or 0 to indicate that the second byte contains an actual
> int8. If you do this with larger integers, say an int32, then you have
> a choice: you could store your int32 in 8 bytes, in which case
> arithmetic etc. is fast, but you waste a bit of memory. Or you could
> store your int32 in 5 bytes, in which case arithmetic etc. become
> somewhat slower, but you don't waste any memory. (This latter case
> would basically be like using an unaligned or byteswapped array in
> current numpy, in terms of mechanisms and speed.)

But avoiding any increase in memory was one of the benefits of this
miniNEP. It really doesn't matter which integer size you use because
you still have the same problem. Also, people use int8 or whatever by
choice due say memory constraints.

>
> -- Nothing in this design rules out a second implementation of NAs
> based on masking. Personally, as you know, I'm not a big fan, but if
> it were added anyway, then you could use that for your integers as
> well.
>
> A related issue is, of the many ways we *can* do integer NA-dtype,
> which one *should* we do by default. I don't have a strong opinion,
> really; I haven't heard anyone say that they have huge quantities of
> integer-plus-NA data that they want to manipulate and
> memory/speed/allowing the full range of values are all really
> important. (Maybe that's you?) In the design as written, they're all
> pretty trivial to implement (you just tweak a few magic numbers in the
> dtype structure), and probably we should support all of them via
> more-or-less exotic invocations of np.withNA. (E.g.,
> 'np.withNA(np.int32, useflag=True, flagsize=1)' to get a 5-byte
> int32.)

I disagree with the comment that this is 'pretty trivial to
implement'. I do not think that is trivial to implement with
acceptable performance and memory costs.


>
> ...I kind of like that NPY_NA_AUTO_CHECK idea, it's pretty clean and
> would definitely make things safer. I think I'll add it.
>
> -- Nathaniel

I am being difficult as I do agree with many of the underlying idea.
But I want something that works with acceptable performance and memory
usage (there should be minor penalty of having masked elements over no
masked elements). I do not find it acceptable when A.dot(B) is slower
than first creating an array without NAs: C=A.noNA(), C.dot(B). Thus
to me an API is insufficient to address that.


Bruce
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] gist gist: 1068264

Reply via email to