On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett <matthew.br...@gmail.com>wrote:
> Hi, > > Just for reference, I am using this as the latest version of the NEP - > I hope it's current: > > > https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst > > I'm mostly relaying stuff I said, although generally (please do > correct me if I am wrong) I am just re-expressing points that > Nathaniel has already made in the alterNEP text and the emails. > > On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire > <cjord...@uw.edu> wrote: > ... > > Since we only have Mark is only around Austin until early August, there's > > also broad agreement that we need to get something done quickly. > > I think I might have missed that part of the discussion :) > > I think that might have been mentioned by Travis right before he had to leave for another meeting, which might have been after you'd disconnected. Travis' concern as a member of a numpy community is the desire for something that is broadly applicable and adopted. But as Mark's employer, his concern is to get a more complete and coherent missing data functionality implemented in numpy while Mark is still at Enthought, for use in the problems Enthought and statisticians commonly encounter if nothing else. > I feel the need to emphasize the centrality of the assertion by > Nathaniel, and agreement by (at least) me, that the NA case (there > really is no data) and the IGNORE case (there is data but I'm > concealing it from you) are conceptually different, and come from > different use-cases. > > The underlying disagreement returned many times to this fundamental > difference between the NEP and alterNEP: > > In the NEP - by design - it is impossible to distinguish between na.NA > and na.IGNORE > The alterNEP insists you should be able to distinguish. > > Mark says something like "it's all missing data, there's no reason you > should want to distinguish". Nathaniel and I were saying "the two > types of missing do have different use-cases, and it should be > possible to distinguish. You might want to chose to treat them the > same, but you should be able to see what they are.". > > I returned several times to this (original point by Nathaniel): > > a[3] = np.NA > > (what does this mean? I am altering the underlying array, or a mask? > How would I explain this to someone?) > > We confirmed that, in order to make it difficult to know what your NA > is (masked or bit-pattern), Mark has to a) hinder access to the data > below the mask and b) prevent direct API access to the masking array. > I described this as 'hobbling the API' and Mark thought of it as > 'generic programming' (missing is always missing). > > I asserted that explaining NA to people would be easier if ``a[3] = > np.NA`` was direct assignment and altered the array. > > > BIT PATTERN & MASK IMPLEMENTATIONS FOR NA > > > ------------------------------------------------------------------------------------------ > > The current NEP proposes both mask and bit pattern implementations for > > missing data. I use the terms bit pattern and parameterized dtype > > interchangeably, since the parameterized dtype will use a bit pattern for > > its implementation. The two implementations will support the same > > functionality with respect to NA, and the implementation details will be > > largely invisible to the user. Their differences are in the 'extra' > features > > each supports. > > > > Two common questions were: > > 1. Why make two implementations of missing data: one with masks and the > > other with parameterized dtypes? > > 2. Why does the implementation using masks have higher priority? > > The answers are: > > 1. The mask implementation is more general and easier to implement and > > maintain. The bit pattern implementation saves memory, makes > > interoperability easier, and makes ABI (Application Binary Interface) > > compatibility easier. Since each has different strengths, the argument is > > both should be implemented. > > 2. The implementation for the parameterized dtypes will rely on the > > implementation using a mask. > > > > NA VS. IGNORE > > --------------------------------- > > A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in > aNEP > > sense and NA in NEP sense. With NA, there is a clear notion of how NA > > propagates through all basic numpy operations. (e.g., 3+NA=NA and > log(NA) = > > NA, while NA | True = True.) IGNORE is separate from NA, with different > > interpretations depending on the use case. > > IGNORE could mean: > > 1. Data that is being temporarily ignored. e.g., a possible outlier that > is > > temporarily being removed from consideration. > > 2. Data that cannot exist. e.g., a matrix representing a grid of water > > depths for a lake. Since the lake isn't square, some entries will > represent > > land, and so depth will be a meaningless concept for those entries. > > 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], > [IGNORE, > > 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though > this > > leaves open how [1, 2, IGNORE] + [3 , 4] should behave. > > Because of these different uses of IGNORE, it doesn't have as clear a > > theoretical interpretation as NA. (For instance, what is IGNORE+3, > IGNORE*3, > > or IGNORE | True?) > > I don't remember this bit of the discussion, but I see from current > masked arrays that IGNORE is treated as the identity, so: > > IGNORE + 3 = 3 > IGNORE * 3 = 3 > > I'd mentioned at the top of my summary that some of the concrete examples weren't talked about, even though the ideas were. And the fact that IGNORE doesn't have a computational model behind it was mentioned briefly, though it wasn't expanded on. If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? > But several of the discussants thought the use cases for IGNORE were very > > compelling. Specifically, they wanted to be able to use IGNORE's and NA's > > simultaneously while still being able to differentiate between them. So, > for > > example, being able to designate some data as IGNORE while still able to > > determine which data was NA but not IGNORE. The current NEP does not > allow > > for this directly. > > I think we discovered that the current NEP is designed to prevent us > distinguishing between these cases. > > I was asking what it was about the implementation (as opposed to the > API) that influenced the decision to make masked and bit-pattern > missing data appear to be identical. I left the conversation before > the end, but up until that point, had failed to understand. > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion