On Wed, Oct 22, 2025 at 10:48 PM Marten van Kerkwijk via NumPy-Discussion <
[email protected]> wrote:

> Hi Carlos,
>
> Indeed, the implementation for making NaN mean "omit" for some functions
> is not too difficult now.  Your example actually leads to the opposite
> conclusion about ease, though, as you really should count also the
> implementation of ``where`` (I added it... with one of the planned goals
> the use in nanfunctions...).  This makes it simple now, but under the
> hood the reductions have to take a different path if where is present
> (see code in `umath/reduction.c`).  So overall supporting NaN as missing
> is actually not simple even for ``sum``, and I am fairly certain the
> same will hold generally.
>
> Now I can see why you would dislike creating a new class, as it adds
> complexity.  But in the end simplicity holds for functions too: if I
> write code to deal with an array of floats, it is far more simple if I
> can treat the elements as standard floats, with the standard meaning of
> it as Not a Number.
>
> It also keeps maintenance of those functions simpler, with fewer tests
> for fewer combinations of arguments, and helps standardization between
> different array types: If we were to go your route, *every* array
> implementation has to start supporting treating NaN in different ways.
> (And why stop there?  IIRC, pandas uses the most negative int to signal
> a masked value; should we start supporting that too?)
>
> Now another way of thinking is that the array should be the same, but it
> needs to be explicit about how its data is interpreted, i.e., signal
> that it wants NaN treated as missing.  That does not necessarily require
> a new array class, but may be possible by creating a new data type,
> which wraps a regular float.  Conceptually, though, that requires
> creating new float loops for every ufunc for which this may matter, so
> again not simple.
>
> Finally, I note that in the data api issue you quote:
>
> > It is better to have 100 functions operate on one data structure than 10
> functions on 10 data structures.
>
> But the obvious answer to that is that, in fact, numpy does exactly that
> by providing the nanfunctions.  There is nothing stopping you from using
> those functions all the time, even when arrays may not have `NaN`.
> Indeed, in a way my suggested new NanMask Array API compatible class
> would just bundle those nanfunctions in a more convenient package...
>
> Anyway, in the end I think all appeaches will end up essentially costing
> the same amount of effort, and I think for a relatively niche case of
> using NaN as masks, one should pick one that does not require changes to
> the base numpy implementations.
>

The way I read your argument about overall implementation complexity -
which I agree is fairly high - is that we shouldn't force that on other
array libraries through the array API standard, since it'd be a lot of work
for them.

For NumPy itself, we already have the existing implementations as well as
the where= machinery, so it's more of an API design question. A keyword
costs a lot less than a new function, API surface wise. I think that's the
only real difference, given that it can call essentially the same
implementation under the hood. We may have to do that in C rather than in
Python perhaps, so a slight complication - but also one with less overhead
then for using the function.

I don't think the array container suggestion is very relevant, for the
reasons Carlos gave as well as others: MaskedArray is quite buggy, mostly
unmaintained, and non-recommended by us; we're quite unlikely to add
another new array container inside NumPy at this point; and
https://github.com/mdhaber/marray is in much better shape but a separate
library - fine for end users, but packages like scikit-learn that just need
some nan-handling are not going to add a dependency on that (they may
possibly vendor it, but that's also expensive).

So I think the relevant choices are:
1. Change nothing to the current status quo (and possibly direct end users
who need more than what we offer now to `marray`)
2. Add a keyword to reductions
3. Add a single factory function that turns regular reductions into
nan-aware ones (as in
https://github.com/data-apis/array-api/issues/621#issuecomment-1553481118)

I think (1) is also a very reasonable outcome if we don't like any of the
alternatives.

Cheers,
Ralf



> "Carlos Martin" <[email protected]> writes:
>
> >> The costs I worry about are performance and increased maintenance
> burden for the regular, no-nan case.  For instance, the "obvious" way to
> implement a nan-omitting sum would be to check inside a loop whether any
> given element was nan, thus slowing down the regular case (e.g., by
> breaking vectorization).  To avoid this one has to be careful, thus making
> code harder to write, more fragile, and more difficult to maintain
> (analogous to -- but worse than -- tracking floating point errors).
> >
> > I'm not sure I understand your objection here. Consider the way `nansum`
> is currently implemented:
> https://github.com/numpy/numpy/blob/76e91189b23d4e0afc34130e95f4f460a3d57d95/numpy/lib/_nanfunctions_impl.py#L725
> .
> >
> >> a, mask = _replace_nan(a, 0)
> >> return np.sum(a, axis=axis, dtype=dtype, out=out, keepdims=keepdims,
> initial=initial, where=where)
> >
> > The `ignore_nan` version would simply do the same thing, but inside the
> body of `numpy.sum`. Or it can call `np.sum` with `where=~np.isnan(a) if
> where is None else ~np.isnan(a) & where` (i.e., combining with any mask the
> user supplies).
> >
> > I object to the approach of complicating the array ontology, for the
> reasons described here:
> https://github.com/data-apis/array-api/issues/621#issuecomment-3433986363.
> > _______________________________________________
> > NumPy-Discussion mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
> > https://mail.python.org/mailman3//lists/numpy-discussion.python.org
> > Member address: [email protected]
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3//lists/numpy-discussion.python.org
> Member address: [email protected]
>
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

Reply via email to