[Numpy-discussion] Re: Revert the return of a single NaN for `np.unique` with floating point numbers?

Sebastian Berg Mon, 08 Nov 2021 10:06:40 -0800

On Fri, 2021-11-05 at 18:44 -0500, Juan Nunez-Iglesias wrote:
> I agree with the argument to revert. Consistency among libraries should
> be near the top of the list of priorities, as should performance.
> Whether the new behaviour "makes more sense", meanwhile, is debatable.
>


I don't have much of an opinion either way.  But while correctness is
debatable, is there any reason someone would prefer getting multiple
(potentially very man) NaNs?  I admit it feels more like a trap to me.


About Performance
-----------------

Note that at least on CPUs there is no reason why there even is an 
overhead [1]!  (We simply miss an `np.islessgreater` function.)

The overhead is a bit annoying admittedly, although pretty typical for
NumPy, I guess.  This one seems to be reducible to around 10% if you
use  `a != a` instead of `isnan` (the ufunc incurs a lot of overhead).
Which is on par with 1.19.x, since NumPy got faster ;).


About Reverting
---------------

It feels like this should have a compat release note, which seems like
a very strong flag for "not just a bug fix".  But maybe there is a good
reason to break with that?


Cheers,

Sebastian



[1] C99 has `islessgreater` and both SSE2 and AVX have intrinsics which
allow reformulating the whole thing to have exactly the same speed as
before, these intrinsics effectively calculate things like:

    !(a < b)   or   !(a != b)

Admittedly, the SSE2 version probably sets FPEs for NaNs (but that is a
problem that we already have and is even now not really avoidable).

So at least for all relevant CPUs, the implementation detail that we
currently don't have `islessgreater`.  (I have not found information of
whether the C99 `islessgreater` has decent performance on CUDA)

(Not surprisingly, MSVC is arguably buggy and does to not compile the
C99 `islessgreate` to fast byte-code – although plausibly link-time
optimizer may safe that.  But since we have hand coded SSE2/AVX
versions of comparisons, even that would probably not matter)



> On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote:
> > 
> > 
> > On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gomm...@gmail.com>
> > wrote:
> > > 
> > > 
> > > On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg
> > > <sebast...@sipsolutions.net> wrote:
> > > > Hi all,
> > > > 
> > > > In NumPy 1.21, the output of `np.unique` changed in the presence
> > > > of
> > > > multiple NaNs.  Previously, all NaNs were returned when we now
> > > > only
> > > > return one (all NaNs were considered unique):
> > > > 
> > > >     a = np.array([1, 1, np.nan, np.nan, np.nan])
> > > > 
> > > > Before 1.21:
> > > > 
> > > >     >>> np.unique(a)
> > > >     array([ 1., nan, nan, nan])
> > > > 
> > > > After 1.21:
> > > > 
> > > >     array([ 1., nan])
> > > > 
> > > > 
> > > > This change was requested in an old issue:
> > > > 
> > > >      https://github.com/numpy/numpy/issues/2111
> > > > 
> > > > And happened here:
> > > > 
> > > >      https://github.com/numpy/numpy/pull/18070
> > > > 
> > > > While, it has a release note.  I am not sure the change got the
> > > > attention it deserved.  This would be especially worrying if it
> > > > is a
> > > > regression for anyone?
> > > 
> > > I think it's now the expected answer, not a regression. `unique` is
> > > not an elementwise function that needs to adhere to IEEE-754 where
> > > nan != nan. I can't remember reviewing this change, but it makes
> > > perfect sense to me.
> > 
> > It turns out there's a lot more to this story. The short summary of
> > it is that:
> > - use case wise it's still true that considering NaN's equal makes
> > more sense.
> > - scikit-learn has a custom implementation which removes duplicate
> > NaN's:
> > https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode.py#L7
> > - all other array/tensor libraries match NumPy's old behavior (return
> > multiple NaN's)
> > - there's a performance and implementation cost for CuPy et al. to
> > match NumPy's changed behavior (multiple NaN's is the natural result,
> > no extra checks needed)
> > - There is also a significant performance cost for NumPy, ~25% for
> > small/medium-size arrays with no/few NaN's (say the benchmarks in
> > https://github.com/numpy/numpy/pull/18070 - which is a common case,
> > and that's not negligible like the PR description claims.
> > - the "single NaN" behavior is easy to get as a utility function on
> > top of the "multiple NaN' (like scikit-learn does), the opposite is
> > much harder
> > - for the array API standard, the final decision was to go with the
> > "multiple NaN' behavior, so we'll need that in `numpy.array_api`.
> > - more discussion in
> > https://github.com/data-apis/array-api/issues/249
> > 
> > It would be good to make up our minds before 1.22.0. Two choices:
> > 1. We can leave it as it is now and work around the issue in
> > `numpy.array_api`. It would also require CuPy and others to make
> > changes, which probably cost performance.
> > 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a
> > release will be made.
> > 
> > There is something to say for either option. I'd personally lean
> > slightly towards reverting because of both consistency with other
> > implementations and performance. But it seems like there's no optimal
> > solution here, it's a fairly hairy issue.
> > 
> > Thoughts?
> > 
> > Ralf
> > 
> > _______________________________________________
> > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > To unsubscribe send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: j...@fastmail.com
> > 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sebast...@sipsolutions.net

signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Revert the return of a single NaN for `np.unique` with floating point numbers?

Reply via email to