Re: [Numpy-discussion] new MaskedArray class

Marten van Kerkwijk Mon, 24 Jun 2019 17:35:36 -0700

On Mon, Jun 24, 2019 at 7:21 PM Stephan Hoyer <sho...@gmail.com> wrote:

> On Mon, Jun 24, 2019 at 3:56 PM Allan Haldane <allanhald...@gmail.com>
> wrote:
>
>> I'm not at all set on that behavior and we can do something else. For
>> now, I chose this way since it seemed to best match the "IGNORE" mask
>> behavior.
>>
>> The behavior you described further above where the output row/col would
>> be masked corresponds better to "NA" (propagating) mask behavior, which
>> I am leaving for later implementation.
>
>
> This does seem like a clean way to *implement* things, but from a user
> perspective I'm not sure I would want separate classes for "IGNORE" vs "NA"
> masks.
>
> I tend to think of "IGNORE" vs "NA" as descriptions of particular
> operations rather than the data itself. There are a spectrum of ways to
> handle missing data, and the right way to propagating missing values is
> often highly context dependent. The right way to set this is in functions
> where operations are defined, not on classes that may be defined far away
> from where the computation happen. For example, pandas has a "min_count"
> parameter in functions for intermediate use-cases between "IGNORE" and "NA"
> semantics, e.g., "take an average, unless the number of data points is
> fewer than min_count."
>

Anything that specific like that is probably indeed outside of the purview
of a MaskedArray class.

But your general point is well taken: we really need to ask clearly what
the mask means not in terms of operations but conceptually.

Personally, I guess like Benjamin I have mostly thought of it as "data here
is bad" (because corrupted, etc.) or "data here is irrelevant" (because of
sea instead of land in a map). And I would like to proceed nevertheless
with calculating things on the remainder. For an expectation value (or,
less obviously, a minimum or maximum), this is mostly OK: just ignore the
masked elements. But even for something as simple as a sum, what is correct
is not obvious: if I ignore the count, I'm effectively assuming the
expectation is symmetric around zero (this is why `vector.dot(vector)`
fails); a better estimate would be `np.add.reduce(data, where=~mask) *
N(total) / N(unmasked)`.

Of course, the logical conclusion would be that this is not possible to do
without guidance from the user, or, thinking more, that really a masked
array is not at all what I want for this problem; really I am just using
(1-mask) as a weight, and the sum of the weights matters, so I should have
a WeightArray class where that is returned along with the sum of the data
(or, a bit less extreme, a `CountArray` class, or, more extreme, a value
and its uncertainty - heck, sounds a lot like my Variable class from 4
years ago, https://github.com/astropy/astropy/pull/3715, which even takes
care of covariance [following the Uncertainty package]).

OK, it seems I've definitely worked myself in a corner tonight where I'm
not sure any more what a masked array is good for in the first place...
I'll stop for the day!

All the best,

Marten

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

Reply via email to