On Wed, Oct 22, 2025 at 10:48 PM Marten van Kerkwijk via NumPy-Discussion < [email protected]> wrote:
> Hi Carlos, > > Indeed, the implementation for making NaN mean "omit" for some functions > is not too difficult now. Your example actually leads to the opposite > conclusion about ease, though, as you really should count also the > implementation of ``where`` (I added it... with one of the planned goals > the use in nanfunctions...). This makes it simple now, but under the > hood the reductions have to take a different path if where is present > (see code in `umath/reduction.c`). So overall supporting NaN as missing > is actually not simple even for ``sum``, and I am fairly certain the > same will hold generally. > > Now I can see why you would dislike creating a new class, as it adds > complexity. But in the end simplicity holds for functions too: if I > write code to deal with an array of floats, it is far more simple if I > can treat the elements as standard floats, with the standard meaning of > it as Not a Number. > > It also keeps maintenance of those functions simpler, with fewer tests > for fewer combinations of arguments, and helps standardization between > different array types: If we were to go your route, *every* array > implementation has to start supporting treating NaN in different ways. > (And why stop there? IIRC, pandas uses the most negative int to signal > a masked value; should we start supporting that too?) > > Now another way of thinking is that the array should be the same, but it > needs to be explicit about how its data is interpreted, i.e., signal > that it wants NaN treated as missing. That does not necessarily require > a new array class, but may be possible by creating a new data type, > which wraps a regular float. Conceptually, though, that requires > creating new float loops for every ufunc for which this may matter, so > again not simple. > > Finally, I note that in the data api issue you quote: > > > It is better to have 100 functions operate on one data structure than 10 > functions on 10 data structures. > > But the obvious answer to that is that, in fact, numpy does exactly that > by providing the nanfunctions. There is nothing stopping you from using > those functions all the time, even when arrays may not have `NaN`. > Indeed, in a way my suggested new NanMask Array API compatible class > would just bundle those nanfunctions in a more convenient package... > > Anyway, in the end I think all appeaches will end up essentially costing > the same amount of effort, and I think for a relatively niche case of > using NaN as masks, one should pick one that does not require changes to > the base numpy implementations. > The way I read your argument about overall implementation complexity - which I agree is fairly high - is that we shouldn't force that on other array libraries through the array API standard, since it'd be a lot of work for them. For NumPy itself, we already have the existing implementations as well as the where= machinery, so it's more of an API design question. A keyword costs a lot less than a new function, API surface wise. I think that's the only real difference, given that it can call essentially the same implementation under the hood. We may have to do that in C rather than in Python perhaps, so a slight complication - but also one with less overhead then for using the function. I don't think the array container suggestion is very relevant, for the reasons Carlos gave as well as others: MaskedArray is quite buggy, mostly unmaintained, and non-recommended by us; we're quite unlikely to add another new array container inside NumPy at this point; and https://github.com/mdhaber/marray is in much better shape but a separate library - fine for end users, but packages like scikit-learn that just need some nan-handling are not going to add a dependency on that (they may possibly vendor it, but that's also expensive). So I think the relevant choices are: 1. Change nothing to the current status quo (and possibly direct end users who need more than what we offer now to `marray`) 2. Add a keyword to reductions 3. Add a single factory function that turns regular reductions into nan-aware ones (as in https://github.com/data-apis/array-api/issues/621#issuecomment-1553481118) I think (1) is also a very reasonable outcome if we don't like any of the alternatives. Cheers, Ralf > "Carlos Martin" <[email protected]> writes: > > >> The costs I worry about are performance and increased maintenance > burden for the regular, no-nan case. For instance, the "obvious" way to > implement a nan-omitting sum would be to check inside a loop whether any > given element was nan, thus slowing down the regular case (e.g., by > breaking vectorization). To avoid this one has to be careful, thus making > code harder to write, more fragile, and more difficult to maintain > (analogous to -- but worse than -- tracking floating point errors). > > > > I'm not sure I understand your objection here. Consider the way `nansum` > is currently implemented: > https://github.com/numpy/numpy/blob/76e91189b23d4e0afc34130e95f4f460a3d57d95/numpy/lib/_nanfunctions_impl.py#L725 > . > > > >> a, mask = _replace_nan(a, 0) > >> return np.sum(a, axis=axis, dtype=dtype, out=out, keepdims=keepdims, > initial=initial, where=where) > > > > The `ignore_nan` version would simply do the same thing, but inside the > body of `numpy.sum`. Or it can call `np.sum` with `where=~np.isnan(a) if > where is None else ~np.isnan(a) & where` (i.e., combining with any mask the > user supplies). > > > > I object to the approach of complicating the array ontology, for the > reasons described here: > https://github.com/data-apis/array-api/issues/621#issuecomment-3433986363. > > _______________________________________________ > > NumPy-Discussion mailing list -- [email protected] > > To unsubscribe send an email to [email protected] > > https://mail.python.org/mailman3//lists/numpy-discussion.python.org > > Member address: [email protected] > _______________________________________________ > NumPy-Discussion mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://mail.python.org/mailman3//lists/numpy-discussion.python.org > Member address: [email protected] >
_______________________________________________ NumPy-Discussion mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3//lists/numpy-discussion.python.org Member address: [email protected]
