Hi, On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn <[email protected]> wrote: > On 07/06/2011 02:46 PM, Matthew Brett wrote: >> Hi, >> >> Sorry, I hope you don't mind, I moved this to it's own thread, trying >> to separate comments on the NA debate from the discussion yesterday. > > I'm sorry. > >> On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn >> <[email protected]> wrote: >>> On 07/06/2011 02:05 PM, Matthew Brett wrote: >>>> Hi, >>>> >>>> Just for reference, I am using this as the latest version of the NEP - >>>> I hope it's current: >>>> >>>> https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst >>>> >>>> I'm mostly relaying stuff I said, although generally (please do >>>> correct me if I am wrong) I am just re-expressing points that >>>> Nathaniel has already made in the alterNEP text and the emails. >>>> >>>> On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire >>>> <[email protected]> wrote: >>>> ... >>>>> Since we only have Mark is only around Austin until early August, there's >>>>> also broad agreement that we need to get something done quickly. >>>> >>>> I think I might have missed that part of the discussion :) >>>> >>>> I feel the need to emphasize the centrality of the assertion by >>>> Nathaniel, and agreement by (at least) me, that the NA case (there >>>> really is no data) and the IGNORE case (there is data but I'm >>>> concealing it from you) are conceptually different, and come from >>>> different use-cases. >>>> >>>> The underlying disagreement returned many times to this fundamental >>>> difference between the NEP and alterNEP: >>>> >>>> In the NEP - by design - it is impossible to distinguish between na.NA >>>> and na.IGNORE >>>> The alterNEP insists you should be able to distinguish. >>>> >>>> Mark says something like "it's all missing data, there's no reason you >>>> should want to distinguish". Nathaniel and I were saying "the two >>>> types of missing do have different use-cases, and it should be >>>> possible to distinguish. You might want to chose to treat them the >>>> same, but you should be able to see what they are.". >>>> >>>> I returned several times to this (original point by Nathaniel): >>>> >>>> a[3] = np.NA >>>> >>>> (what does this mean? I am altering the underlying array, or a mask? >>>> How would I explain this to someone?) >>>> >>>> We confirmed that, in order to make it difficult to know what your NA >>>> is (masked or bit-pattern), Mark has to a) hinder access to the data >>>> below the mask and b) prevent direct API access to the masking array. >>>> I described this as 'hobbling the API' and Mark thought of it as >>>> 'generic programming' (missing is always missing). >>> >>> Here's an HPC perspective...: >>> >>> If you, say, want to off-load array processing with a mask to some code >>> running on a GPU, you really can't have the GPU go through some NumPy >>> API. Or if you want to implement a masked array on a cluster with MPI, >>> you similarly really, really want raw access. >>> >>> At least I feel that the transparency of NumPy is a huge part of its >>> current success. Many more than me spend half their time in C/Fortran >>> and half their time in Python. >>> >>> I tend to look at NumPy this way: Assuming you have some data in memory >>> (possibly loaded by a C or Fortran library). (Almost) no matter how it >>> is allocated, ordered, packed, aligned -- there's a way to find strides >>> and dtypes to put a nice NumPy wrapper around it and use the memory from >>> Python. >>> >>> So, my view on Mark's NEP was: With a reasonably amount of flexibility >>> in how you decided to implement masking for your data, you can create a >>> NumPy wrapper that will understand that. Whether your Fortran library >>> exposes NAs in its 40GB buffer as bit patterns, or using a seperate >>> mask, both will work. >>> >>> And IMO Mark's NEP comes rather close to this, you just need an >>> additional NEP later to give raw details to the implementation details, >>> once those are settled :-) >> >> I was a little puzzled as to what you were trying to say, but I >> suspect that's my ignorance about Numpy internals. >> >> Superficially, I would have assumed that, making masked and >> bit-pattern NAs behave the same in numpy, would take you away from the >> raw data, in the sense that you not only need the dtype, you also need >> the mask machinery, in order to know if you have an NA. Later I >> realized that you probably weren't saying that. So, just for my >> unhappy ignorance - how does the HPC perspective relate to debate >> about "can / can't distinguish NA from ignore"? > > I just commented on the "prevent direct API access to the masking array" > part -- I'm hoping direct access by external code to the underlying > implementation details will be allowed, at some point. > > What I'm saying is that Mark's proposal is more flexible. Say for the > sake of the argument that I have two codes I need to interface with: > > - Library A is written in Fortran and uses a seperate (explicit) mask > array for NA > > - Library B runs on a GPU and uses a bit pattern for NA > > Mark's proposal then comes closer to allowing me to wrap both codes > using NumPy, since it supports both implementation mechanisms. Sure, it > would need a seperate NEP down the road to extend it, but it goes in the > right direction for this to happen.
I'm sorry - honestly - maybe it's because I've just had lunch, but I think I am not understanding something. When you say "Mark's proposal is more flexible" - more flexible than what? I think we agree that: * NA bitpatterns are good to have * masks are good to have and the discussion is about: * should it be possible to distinguish between bitpatterns (NAs) and masks (IGNORE). Are you saying that making it not-possible to distinguish - at the numpy level, is more flexible? Cheers, Matthew _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
