I agree with the argument to revert. Consistency among libraries should be near 
the top of the list of priorities, as should performance. Whether the new 
behaviour "makes more sense", meanwhile, is debatable.

On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote:
> 
> 
> On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gomm...@gmail.com> wrote:
>> 
>> 
>> On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg <sebast...@sipsolutions.net> 
>> wrote:
>>> Hi all,
>>> 
>>> In NumPy 1.21, the output of `np.unique` changed in the presence of
>>> multiple NaNs.  Previously, all NaNs were returned when we now only
>>> return one (all NaNs were considered unique):
>>> 
>>>     a = np.array([1, 1, np.nan, np.nan, np.nan])
>>> 
>>> Before 1.21:
>>> 
>>>     >>> np.unique(a)
>>>     array([ 1., nan, nan, nan])
>>> 
>>> After 1.21:
>>> 
>>>     array([ 1., nan])
>>> 
>>> 
>>> This change was requested in an old issue:
>>> 
>>>      https://github.com/numpy/numpy/issues/2111
>>> 
>>> And happened here:
>>> 
>>>      https://github.com/numpy/numpy/pull/18070
>>> 
>>> While, it has a release note.  I am not sure the change got the
>>> attention it deserved.  This would be especially worrying if it is a
>>> regression for anyone?
>> 
>> I think it's now the expected answer, not a regression. `unique` is not an 
>> elementwise function that needs to adhere to IEEE-754 where nan != nan. I 
>> can't remember reviewing this change, but it makes perfect sense to me.
> 
> It turns out there's a lot more to this story. The short summary of it is 
> that:
> - use case wise it's still true that considering NaN's equal makes more sense.
> - scikit-learn has a custom implementation which removes duplicate NaN's: 
> https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode.py#L7
> - all other array/tensor libraries match NumPy's old behavior (return 
> multiple NaN's)
> - there's a performance and implementation cost for CuPy et al. to match 
> NumPy's changed behavior (multiple NaN's is the natural result, no extra 
> checks needed)
> - There is also a significant performance cost for NumPy, ~25% for 
> small/medium-size arrays with no/few NaN's (say the benchmarks in 
> https://github.com/numpy/numpy/pull/18070 - which is a common case, and 
> that's not negligible like the PR description claims.
> - the "single NaN" behavior is easy to get as a utility function on top of 
> the "multiple NaN' (like scikit-learn does), the opposite is much harder
> - for the array API standard, the final decision was to go with the "multiple 
> NaN' behavior, so we'll need that in `numpy.array_api`.
> - more discussion in https://github.com/data-apis/array-api/issues/249
> 
> It would be good to make up our minds before 1.22.0. Two choices:
> 1. We can leave it as it is now and work around the issue in 
> `numpy.array_api`. It would also require CuPy and others to make changes, 
> which probably cost performance.
> 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a release will 
> be made.
> 
> There is something to say for either option. I'd personally lean slightly 
> towards reverting because of both consistency with other implementations and 
> performance. But it seems like there's no optimal solution here, it's a 
> fairly hairy issue.
> 
> Thoughts?
> 
> Ralf
> 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: j...@fastmail.com
> 
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to