All, I’ve been fairly quiet on the various missing/masked values proposals, sorry about that. I figured I would take the opportunity of Travis O.’s deadline of Wed. 16th to put some aspects of numpy.ma in a semi-historical, semi-personal perspective. Apologies in advance if I misrepresent some aspects, I count on our archivists to set the record straight when needed. Let it be also known that I haven't had any time to play with the new NA implementation...
Once upon a time... ------------------- numpy.ma was developed on top of Paul Dubois’ implementation for Numeric and its adaptation to Numarray. Originally a masked array was its own object, the combination of a regular ndarray (the `.data`) and its `.mask` counterpart, a boolean ndarray with the same shape as the `.data`. Two special values were also defined, `nomask`, a flag in practice equal to False indicating that no item was masked in the `.data` array and `masked`, representing an item being missing or ignored in the `.data`. The `.mask` had either `True` items (if the corresponding item in the `.data` was missing or ignored, ie was `masked`) or `False` items otherwise. Nothing surprising here. I started to use numpy to analyze hydroclimatic datasets that are often incomplete: for some reasons or another, data wasn’t recorded for some period of time, and I wanted to take this aspect into account. Maskedarrays-as-objects met my needs until I tried to create a subclass… After unsuccessful attempts, doubtlessly due to my inexperience, I came to the realization that most of the issues I was experiencing could perhaps be solved by making the MaskedArray class a subclass of ndarray. The new implementation followed the original one as much as possible, for compatibility of course but also because I felt that there must have been some pretty good reasons for why things were coded the way they were, if I couldn’t exactly see why at the time. For example, the `nomask` flag, the `fill value` concept, the shared mask approach were kept, as well as the MaskedUnary/MaskedBinary/DomainBinaryOperations layers. On that regard, actually, there have been some modifications. Initially, the `.mask` of the output was computed first (from the `.mask` of the input(s) and a validity domain, if any), and the non-masked part filled accordingly. This approach worked well in most cases but turned out to be problematic in some domained operations: finding the domain of np.pow in particular required too many tests. Therefore, the most efficient (well, least inefficient) approach was to compute first the output `.data` and then its mask from the input masks and the invalid results. An ugly hack, yes, but which was deemed necessary with a pure Python implementation. Paul Dubois had confided during an email exchange that if things had been different, he would have coded MaskedArrays in C. I would have done the same thing if I could speak C (I'm still slowly learning it). I even considered Cython for a while, but never managed to subclass a ndarray (things may have changed now, it's been a while since I last tried). <tl;dr> I think that most of the current issues with numpy.ma come from the fact that it's pure Python. The API itself hasn't really changed since the very beginning of Numeric, even if it has been improved (e.g. to support structured type) and the original masked-array-as-an-object concept has been replaced by a subclass of ndarray. IMHO, most of the overhead would be compensated by a lower-level implementation. In any case, numpy.ma is currently in pure Python and therefore suboptimal. For this reason, I always considered it mostly as a convenience module for simple problems; for more elaborate ones, I advised to manipulate the `.data` and `.mask` (actually, the `._data` and `._mask`...) independently, at least till we come with a more efficient solution. I'm not saying that the current concepts behind numpy.ma are perfect, by far. There's no distinction between np.NA and np.IGNORED, for example, and numpy.ma automatically mask invalid data (like INF or NAN), which may be numpy.ma unmaintained ? ----------------------- Chuck claimed [ http://article.gmane.org/gmane.comp.python.numeric.general/49815] that numpy.ma is unmaintained. While I must recognized I haven't been able to work on the bugs for the past 18 months, I'm hoping that my soon-to-be future employer will leave me a bit of time to work on open-source projects (anyway, if this summer's weather is like this spring's, I won't be spending a lot of time in a park). Nevertheless, the tickets I've seen so far don't look that scary and could be fixed rather easily by anybody. The ticket about the printing of masked arrays with a given precision is trickier, though (as the current implementation relies on a trick to avoid having to redo everything). MaskedArray by default? Yesss! ------------------------------ So far, I'm 100% with Chuck when he recently [ http://article.gmane.org/gmane.comp.python.numeric.general/49489] suggested to transform all ndarrays in masked arrays. That was something I had half-jokingly suggested at the time, because it would have made things so much easier for me. Of course, most users never have to deal with missing or ignored values, and forcing a ndarray to carry an extra mask is counter-productive. Nevertheless, it shouldn't be as a problem, as Chuck pointed out. numpy.ma has already the `nomask` flag that is used to check whether there are no missing/ignored data in the array (a nifty trick Paul Dubois introduced to save some time) >>> if marray.mask is nomask: >>> (no missing values: do as if you have a basic ndarray) >>> else: >>> (some missing values: proceed as needed) We could use the same scheme in C, with a flag indicating whether a mask is defined. If the mask is not set, we could use the standard ufuncs. If it is, we would use ufuncs modified to take a mask into account, or raise some exception if they are not yet defined or could not be defined (e.g., FFT). Introducing the flag would no doubt break some things downstream, but it still looks like the easiest. np.NA vs np.IGNORED / bitpattern vs mask ---------------------------------------- I think everybody recognizes that np.NA and np.IGNORED are different yet complementary concepts. A recent post[ http://article.gmane.org/gmane.comp.python.numeric.general/49347] illustrated a use of missing data for concentrations below a given limit. Now imagine a series of concentrations recorded over a period of time where (i) the captor failed to work (no point) (ii) the captor worked properly but only traces were recorded (a point below limit). The first case (i) would be np.NA, the second np.IGNORED. Should you perform some time series statistics, you may want to be able to distinguish between the two. I hope we all agree that np.NA is destructive, np.IGNORED is not. I would expect to be able to assign values as such: a[...] = np.NA <=> a[...].data=np.NA a[...] = np.IGNORED <=> a[...].mask=True (numpy.ma convention) That means that I could still 'ignore' a 'missing' value, by setting the corresponding item in the `.mask` to the appropriate value (True w/ the numpy.ma convention, False in Mark's...). Finding where the np.NA are in an array could be done through a np.isna function (analogous to np.isnan). What about the ignored values? Should we just take the .mask? Do we need a np.isignored ? Computationally, both np.NA and np.IGNORED could be lumped in a single mask. We would then have to decide how to use this combined mask, which brings the subject of propagation. I disagree w/ Lluis' proposal [ http://article.gmane.org/gmane.comp.python.numeric.general/46728] to consider destructiveness and propagation as orthogonal properties. I agree with the following, though: np.NA : propagating np.IGNORED : non-propagating. If a user wants to have a non-destructive but propagating flag, it'd be up to her (give me an example where it could be useful). On a value basis, I agree with operation(np.NA, x) -> np.NA [eg, np.NA + x = np.NA] operation(np.IGNORED, x) -> np.IGNORED [eg, np.IGNORED + x = np.IGNORED] And now ? --------- I think a first step would be to agree on what the behavior should be, which would define a set of unit tests. Could we consider a wiki page somewhere that would include polls, so that each of us could vote and/or discuss specifics ? Then, of course, there's the matter of implementation. I won't have a lot of time to propose actual code for the next few weeks (couple of months), so I'm not expecting my voice to weigh more than others… I'd be glad to participate in civil discussions or to answer emails off-list on numpy.ma related aspects and hopefully I'll be more active by the end of the summer, beginning of the fall. Cordially, P.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion