Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Dag Sverre Seljebotn Wed, 06 Jul 2011 06:12:45 -0700

On 07/06/2011 02:46 PM, Matthew Brett wrote:
> Hi,
>
> Sorry, I hope you don't mind, I moved this to it's own thread, trying
> to separate comments on the NA debate from the discussion yesterday.


I'm sorry.

> On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn
> <d.s.seljeb...@astro.uio.no>  wrote:
>> On 07/06/2011 02:05 PM, Matthew Brett wrote:
>>> Hi,
>>>
>>> Just for reference, I am using this as the latest version of the NEP -
>>> I hope it's current:
>>>
>>> https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst
>>>
>>> I'm mostly relaying stuff I said, although generally (please do
>>> correct me if I am wrong) I am just re-expressing points that
>>> Nathaniel has already made in the alterNEP text and the emails.
>>>
>>> On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
>>> <cjord...@uw.edu>    wrote:
>>> ...
>>>> Since we only have Mark is only around Austin until early August, there's
>>>> also broad agreement that we need to get something done quickly.
>>>
>>> I think I might have missed that part of the discussion :)
>>>
>>> I feel the need to emphasize the centrality of the assertion by
>>> Nathaniel, and agreement by (at least) me, that the NA case (there
>>> really is no data) and the IGNORE case (there is data but I'm
>>> concealing it from you) are conceptually different, and come from
>>> different use-cases.
>>>
>>> The underlying disagreement returned many times to this fundamental
>>> difference between the NEP and alterNEP:
>>>
>>> In the NEP - by design - it is impossible to distinguish between na.NA
>>> and na.IGNORE
>>> The alterNEP insists you should be able to distinguish.
>>>
>>> Mark says something like "it's all missing data, there's no reason you
>>> should want to distinguish".  Nathaniel and I were saying "the two
>>> types of missing do have different use-cases, and it should be
>>> possible to distinguish.  You might want to chose to treat them the
>>> same, but you should be able to see what they are.".
>>>
>>> I returned several times to this (original point by Nathaniel):
>>>
>>> a[3] = np.NA
>>>
>>> (what does this mean?   I am altering the underlying array, or a mask?
>>>     How would I explain this to someone?)
>>>
>>> We confirmed that, in order to make it difficult to know what your NA
>>> is (masked or bit-pattern), Mark has to a) hinder access to the data
>>> below the mask and b) prevent direct API access to the masking array.
>>> I described this as 'hobbling the API' and Mark thought of it as
>>> 'generic programming' (missing is always missing).
>>
>> Here's an HPC perspective...:
>>
>> If you, say, want to off-load array processing with a mask to some code
>> running on a GPU, you really can't have the GPU go through some NumPy
>> API. Or if you want to implement a masked array on a cluster with MPI,
>> you similarly really, really want raw access.
>>
>> At least I feel that the transparency of NumPy is a huge part of its
>> current success. Many more than me spend half their time in C/Fortran
>> and half their time in Python.
>>
>> I tend to look at NumPy this way: Assuming you have some data in memory
>> (possibly loaded by a C or Fortran library). (Almost) no matter how it
>> is allocated, ordered, packed, aligned -- there's a way to find strides
>> and dtypes to put a nice NumPy wrapper around it and use the memory from
>> Python.
>>
>> So, my view on Mark's NEP was: With a reasonably amount of flexibility
>> in how you decided to implement masking for your data, you can create a
>> NumPy wrapper that will understand that. Whether your Fortran library
>> exposes NAs in its 40GB buffer as bit patterns, or using a seperate
>> mask, both will work.
>>
>> And IMO Mark's NEP comes rather close to this, you just need an
>> additional NEP later to give raw details to the implementation details,
>> once those are settled :-)
>
> I was a little puzzled as to what you were trying to say, but I
> suspect that's my ignorance about Numpy internals.
>
> Superficially, I would have assumed that, making masked and
> bit-pattern NAs behave the same in numpy, would take you away from the
> raw data, in the sense that you not only need the dtype, you also need
> the mask machinery, in order to know if you have an NA.   Later I
> realized that you probably weren't saying that.  So, just for my
> unhappy ignorance - how does the HPC perspective relate to debate
> about "can / can't distinguish NA from ignore"?

I just commented on the "prevent direct API access to the masking array" 
part -- I'm hoping direct access by external code to the underlying 
implementation details will be allowed, at some point.

What I'm saying is that Mark's proposal is more flexible. Say for the 
sake of the argument that I have two codes I need to interface with:

  - Library A is written in Fortran and uses a seperate (explicit) mask 
array for NA

  - Library B runs on a GPU and uses a bit pattern for NA

Mark's proposal then comes closer to allowing me to wrap both codes 
using NumPy, since it supports both implementation mechanisms. Sure, it 
would need a seperate NEP down the road to extend it, but it goes in the 
right direction for this to happen.

As for NA vs. IGNORE I still think 2 types is too little. One should 
allow for 255 different NA-values, each with user-defined behaviour. 
Again, Mark's proposal then makes a good start on that, even if more 
work would be needed to make it happen.

I.e., in my perfect world I'd do this to wrap library A (Cythonish 
psuedo-code:

def call_lib_A():
     ...
     lib_A_function(arraybuf, maskbuf, ...)
     DOG_ATE_IT = np.NA("DOG_ATE_IT", value=42, behaviour="raise")
     # behaviour could also be "zero", "invalid"
     missing_value_map = {0xAF: np.NA, 0x43: np.IGNORE, 0xF0: DOG_ATE_IT}
     result = np.PyArray_CreateArrayFromBufferWithMaskBuffer(
         arraybuf, maskbuf, missing_value_map, ...)
     return result

def call_lib_B():
     lib_B_function(arraybuf, ...)
     missing_value_patterns = {0xFFFFCACA : np.NA}
     result = np.PyArray_CreateArrayFromBufferWithBitPattern(
         arraybuf, maskbuf, missing_value_patterns, ...)
     return result

Hope that is clearer. Again, my intention is not to suggest even more 
work at the present stage, just to state some advantages with the 
general direction of Mark's proposal.

Dag Sverre
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Reply via email to