On Wed, Jun 29, 2011 at 11:53 AM, Mark Wiebe <[email protected]> wrote:
> On Tue, Jun 28, 2011 at 7:34 AM, Lluís <[email protected]> wrote: > >> Mark Wiebe writes: >> > The design that's forming is a combination of: >> >> > * Solve the missing data problem >> > * My ideas of what a good solution looks like: >> > * applies to all NumPy dtypes in a fully general way >> > * high-performance, low overhead where possible >> > * makes the C-level implementation of NumPy nicer to work with, not >> harder >> > * easy to use from Python for unskilled programmers >> > * easy to use more powerful functionality from Python for skilled >> programmers >> > * satisfies all or most of the needs of the many users of arrays with >> a "missing data" aspect to them >> >> I would add here an efficient mechanism to reinterpret exising data with >> different missing information (no copies of the backing array). >> >> Although I'm not sure whether this requires first-class citizenship or >> not. >> > > I'm calling this idea "masking semantics" generally. > > > * All the feedback I'm getting from discussions on the list >> [...] >> > I've updated a section "Parameterized Data Type With NA Signal Values" >> > in the NEP with an idea for now an NA bit pattern approach could >> > coexist and work together with the mask-based approach. I think I've >> > solved some of the generality and implementation obstacles, it would >> > be great to get some feedback on that. >> >> Some (obvious) thoughts about it: >> >> * Trivial to store, as the missing property is encoded in the value >> itself. >> * Third-party (non-Python) code needs some interface to interpret these >> without having to know the implementation details (although the >> interface is rather trivial). >> * Data marked as missing loses its original value. >> * Reinterpreting the same data (memory buffer) with different missing >> information requires either memory copies or separate mask arrays (see >> above) >> >> So, while it (data types with NA signal values) has its advantages on a >> simpler interaction with 3rd party code and during long-term storage, >> masks will still be needed. >> >> I think that deciding on the value of NA signal values boils down to >> this question: should 3rd party code be able to interpret missing data >> information stored in the separate mask array? >> > > I'm tossing around some variations of ideas using the iterator to provide a > buffered mask-based interface that works uniformly with both masked arrays > and NA dtypes. This way 3rd party C code only needs to implement one missing > data mechanism to fully support both of NumPy's missing data mechanisms. > > ;) Also, it avoids a horrible mass of code. Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
