Mark Wiebe writes: > The design that's forming is a combination of: > * Solve the missing data problem > * My ideas of what a good solution looks like: > * applies to all NumPy dtypes in a fully general way > * high-performance, low overhead where possible > * makes the C-level implementation of NumPy nicer to work with, not harder > * easy to use from Python for unskilled programmers > * easy to use more powerful functionality from Python for skilled > programmers > * satisfies all or most of the needs of the many users of arrays with a > "missing data" aspect to them
I would add here an efficient mechanism to reinterpret exising data with different missing information (no copies of the backing array). Although I'm not sure whether this requires first-class citizenship or not. > * All the feedback I'm getting from discussions on the list [...] > I've updated a section "Parameterized Data Type With NA Signal Values" > in the NEP with an idea for now an NA bit pattern approach could > coexist and work together with the mask-based approach. I think I've > solved some of the generality and implementation obstacles, it would > be great to get some feedback on that. Some (obvious) thoughts about it: * Trivial to store, as the missing property is encoded in the value itself. * Third-party (non-Python) code needs some interface to interpret these without having to know the implementation details (although the interface is rather trivial). * Data marked as missing loses its original value. * Reinterpreting the same data (memory buffer) with different missing information requires either memory copies or separate mask arrays (see above) So, while it (data types with NA signal values) has its advantages on a simpler interaction with 3rd party code and during long-term storage, masks will still be needed. I think that deciding on the value of NA signal values boils down to this question: should 3rd party code be able to interpret missing data information stored in the separate mask array? If the answer is no, then 3rd party code should be given a copy of the data where the masked array is merged with the ndarray data buffer (assuming the original ndarray had a masked array before passing it to the 3rd party code). As by definition (?) the ndarray with a mask must retain the original data, the result of the 3rd party code must be translated back into an ndarray + mask. If the answer is yes, then I think the NA signal values just add unnecessary complexity, as the 3rd party code will already need to use some numpy-specific API to handle missing data through the ndarray buffer + mask buffer. This reminds me that if 3rd party were to use the new iterator interface, the interface could be twisted in a way that it returns only the non-missing parts. For the sake of performance, this could be optional, so that the default behaviour is to just iterate through non-missing data but an option can be used to iterate over all data, and leave missing data handling up to the 3rd party code. My 2 cents, Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
