Re: [Numpy-discussion] new MaskedArray class

Allan Haldane Tue, 18 Jun 2019 09:55:10 -0700

On 6/18/19 10:06 AM, Marten van Kerkwijk wrote:
> Hi Allen,
> 
> Thanks for the message and link! In astropy, we've been struggling with
> masking a lot, and one of the main conclusions I have reached is that
> ideally one has a more abstract `Masked` class that can take any type of
> data (including `ndarray`, of course), and behaves like that data as
> much as possible, to the extent that if, e.g., I create a
> `Masked(Quantity(..., unit), mask=...)`, the instance will have a
> `.unit` attribute and perhaps even `isinstance(..., Quantity)` will
> hold. And similarly for `Masked(Time(...), mask=...)`,
> `Masked(SkyCoord(...), mask=...)`, etc. In a way, `Masked` would be a
> kind of Mixin-class that just tracks a mask attribute.
> 
> This may be too much to ask from the initializer, but, if so, it still
> seems most useful if it is made as easy as possible to do, say, `class
> MaskedQuantity(Masked, Quantity): <very few overrides>`.


Currently MaskedArray does not accept ducktypes as underlying arrays,
but I think it shouldn't be too hard to modify it to do so. Good idea!

I already partly navigated this mixin-issue in the
"MaskedArrayCollection" class, which essentially does
ArrayCollection(MaskedArray(array)), and only takes about 30 lines of
boilerplate. That's the backwards encapsulation order from what you want
though.

> Even if this impossible, I think it is conceptually useful to think
> about what the masking class should do. My sense is that, e.g., it
> should not attempt to decide when an operation succeeds or not, but just
> "or together" input masks for regular, multiple-input functions, and let
> the underlying arrays skip elements for reductions by using `where`
> (hey, I did implement that for a reason... ;-). In particular, it
> suggests one should not have things like domains and all that (I never
> understood why `MaskedArray` did that). If one wants more, the class
> should provide a method that updates the mask (a sensible default might
> be `mask |= ~np.isfinite(result)` - here, the class being masked should
> logically support ufuncs and functions, so it can decide what "isfinite"
> means).

I agree it would be nice to remove domains. It would make life easier,
and I could remove a lot of twiddly code! I kept it in for now to
minimize the behavior changes from the old MaskedArray.

> In any case, I would think that a basic truth should be that everything
> has a mask with a shape consistent with the data, so
> 1. Each complex numbers has just one mask, and setting `a.imag` with a
> masked array should definitely propagate the mask.
> 2. For a masked array with structured dtype, I'd similarly say that the
> default is for a mask to have the same shape as the array. But that
> something like your collection makes sense for the case where one wants
> to mask items in a structure.

Agreed that we should have a single bool per complex or structured
element, and the mask shape is the same as the array shape. That's how I
implemented it. But there is still a problem with complex.imag assignment:

    >>> a = MaskedArray([1j, 2, X])
    >>> i = a.imag
    >>> i[:] = MaskedArray([1, X, 1])

If we make the last line copy the mask to the original array, what
should the real part of a[2] be? Conversely, if we don't copy the mask,
what should the imag part of a[1] be? It seems like we might "want" the
masks to be OR'd instead, but then should i[2] be masked after we just
set it to 1?


> All the best,
> 
> Marten
> 
> p.s. I started trying to implement the above "Mixin" class; will try to
> clean that up a bit so that at least it uses `where` and push it up.

I played with "where", but didn't include it since 1.17 is not released.
To avoid duplication of effort, I've attached a diff of what I tried. I
actually get a slight slowdown of about 10% by using where...

If you make progress with the mixin, a push is welcome. I imagine a
problem is going to be that np.isscalar doesn't work to detect duck scalars.

Cheers,
Allan




> On Mon, Jun 17, 2019 at 6:43 PM Allan Haldane <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi all,
> 
>     Chuck suggested we think about a MaskedArray replacement for 1.18.
> 
>     A few months ago I did some work on a MaskedArray replacement using
>     `__array_function__`, which I got mostly working. It seems like a good
>     time to bring it up for discussion now. See it at:
> 
>     https://github.com/ahaldane/ndarray_ducktypes
> 
>     It should be very usable, it has docs you can read, and it passes a
>     pytest-suite with 1024 tests adapted from numpy's MaskedArray tests.
>     What is missing? It needs even more tests for new functionality, and a
>     couple numpy-API functions are missing, in particular `np.median`,
>     `np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could
>     also find many things to improve too.
> 
>     Besides fixing many little annoyances from MaskedArray, and simplifying
>     the logic by always storing the mask in full, it also has new features.
>     For instance it allows the use of a "X" variable to mark masked
>     locations during array construction, and I solve the issue of how to
>     mask individual fields of a structured array differently.
> 
>     At this point I would by happy to get some feedback on the design and
>     what seems good or bad. If it seems like a good start, I'd be happy to
>     move it into a numpy repo of some sort for further collaboration &
>     discussion, and maybe into 1.18. At the least I hope it can serve as a
>     design study of what we could do.
> 
> 
> 
> 
> 
>     Let me also drop here two more interesting detailed issues:
> 
>     First, the issue of what to do about .real and .imag of complex arrays,
>     and similarly about field-assignment of structured arrays. The problem
>     is that we have a single mask bool per element of a complex array, but
>     what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the
>     mask of the original array change? Should we make .real and .imag
>     readonly?
> 
>     Second, a more general issue of how to ducktype scalars when using
>     `__array_function__` which I think many ducktype implementors will have
>     to face. For MaskedArray, I created an associated "MaskedScalar" type.
>     However, MaskedScalar has to behave differently from normal numpy
>     scalars in a number of ways: It is not part of the numpy scalar
>     hierarchy, it fails checks `isinstance(var, np.floating)`, and
>     np.isscalar returns false. Numpy scalar types cannot be subclassed. We
>     have discussed before the need to have distinction between 0d-arrays and
>     scalars, so we shouldn't just use a 0d (in fact, this makes printing
>     very difficult). This leads me to think that in future dtype-overhaul
>     plans, we should consider creating a subclassable `np.scalar` base type
>     to wrap all numpy scalar variables, and code like `isinstance(var,
>     np.floating)` should be replaced by `isinstance(var.dtype.type,
>     np.floating)` or similar. That is, the numeric dtype of the scalar is no
>     longer encoded in `type(var)` but in `var.dtype`: The fact that the
>     variable is a numpy scalar is decoupled from its numeric dtype.
> 
>     This is useful because there are many "associated" properties of scalars
>     in common with arrays which have nothing to do with the dtype, which
>     ducktype implementors want to touch. I imagine this will come up a lot:
>     In that repo I also have an "ArrayCollection" ducktype which required a
>     "CollectionScalar" scalar, and similarly I imagine people implementing
>     units want the units attached to the scalar, independently of the dtype.
> 
>     Cheers,
>     Allan
> 
> 
> 
> 
>     _______________________________________________
>     NumPy-Discussion mailing list
>     [email protected] <mailto:[email protected]>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

diff --git a/MaskedArray.py b/MaskedArray.py
index 2df12b4..272699f 100755
--- a/MaskedArray.py
+++ b/MaskedArray.py
@@ -904,12 +904,23 @@ class _Masked_BinOp(_Masked_UFunc):
         if isinstance(initial, (MaskedScalar, MaskedX)):
             raise ValueError("initial should not be masked")
 
-        if not np.isscalar(da):
-            da[ma] = self.reduce_fill(da.dtype)
-            # if da is a scalar, we get correct result no matter fill
+        if 1: # two different implementations, investigate performance
+            wheremask = ~ma
+            if 'where' in kwargs:
+                wheremask |= kwargs['where']
+            kwargs['where'] = wheremask
+            if 'initial' not in kwargs:
+                kwargs['initial'] = self.reduce_fill(da.dtype)
+
+            result = self.f.reduce(da, **kwargs)
+            m = np.logical_and.reduce(ma, **mkwargs)
+        else:
+            if not np.isscalar(da):
+                da[ma] = self.reduce_fill(da.dtype)
+                # if da is a scalar, we get correct result no matter fill
 
-        result = self.f.reduce(da, **kwargs)
-        m = np.logical_and.reduce(ma, **mkwargs)
+            result = self.f.reduce(da, **kwargs)
+            m = np.logical_and.reduce(ma, **mkwargs)
 
         ## Code that might be used to support domained ufuncs. WIP
         #with np.errstate(divide='ignore', invalid='ignore'):

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

Reply via email to