Hi, On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith <n...@pobox.com> wrote: ... > If we think that the memory overhead for floating point types is too > high, it would be easy to add a special case where maybe(float) used a > distinguished NaN instead of a separate boolean. The extra complexity > would be isolated to the 'maybe' dtype's inner loop functions, and > transparent to the Python level. (Implementing a similar optimization > for the masking approach would be really nasty.) This would change the > overhead comparison to 0% versus 12.5% in favor of the dtype approach.
Can I take this chance to ask Mark a bit more about the problems he sees for the dtypes with missing values? That is have a np.float64_with_missing np.int32_with_missing type dtypes. I see in your NEP you say 'The trouble with this approach is that it requires a large amount of special case code in each data type, and writing a new data type supporting missing data requires defining a mechanism for a special signal value which may not be possible in general.' Just to be clear, you are saying that that, for each dtype, there needs to be some code doing: missing_value = dtype.missing_value then, in loops: if val[here] == missing_value: do_something() and the fact that 'missing_value' could be any type would make the code more complicated than the current case where the mask is always bools or something? Nathaniel's point about reduction in storage needed for the mask to 0 is surely significant if we want numpy to be the best choice for big data. You mention that it would be good to allow masking for any new dtype - is that a practical problem? I mean, how many people will in fact have the combination of a) need of masking b) need of custom dtype, and c) lack of time or expertise to implement masking for that type? Thanks a lot for the proposal and the discussion, Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion