On Fri, 2021-11-05 at 18:44 -0500, Juan Nunez-Iglesias wrote: > I agree with the argument to revert. Consistency among libraries should > be near the top of the list of priorities, as should performance. > Whether the new behaviour "makes more sense", meanwhile, is debatable. >
I don't have much of an opinion either way. But while correctness is debatable, is there any reason someone would prefer getting multiple (potentially very man) NaNs? I admit it feels more like a trap to me. About Performance ----------------- Note that at least on CPUs there is no reason why there even is an overhead [1]! (We simply miss an `np.islessgreater` function.) The overhead is a bit annoying admittedly, although pretty typical for NumPy, I guess. This one seems to be reducible to around 10% if you use `a != a` instead of `isnan` (the ufunc incurs a lot of overhead). Which is on par with 1.19.x, since NumPy got faster ;). About Reverting --------------- It feels like this should have a compat release note, which seems like a very strong flag for "not just a bug fix". But maybe there is a good reason to break with that? Cheers, Sebastian [1] C99 has `islessgreater` and both SSE2 and AVX have intrinsics which allow reformulating the whole thing to have exactly the same speed as before, these intrinsics effectively calculate things like: !(a < b) or !(a != b) Admittedly, the SSE2 version probably sets FPEs for NaNs (but that is a problem that we already have and is even now not really avoidable). So at least for all relevant CPUs, the implementation detail that we currently don't have `islessgreater`. (I have not found information of whether the C99 `islessgreater` has decent performance on CUDA) (Not surprisingly, MSVC is arguably buggy and does to not compile the C99 `islessgreate` to fast byte-code – although plausibly link-time optimizer may safe that. But since we have hand coded SSE2/AVX versions of comparisons, even that would probably not matter) > On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote: > > > > > > On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gomm...@gmail.com> > > wrote: > > > > > > > > > On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg > > > <sebast...@sipsolutions.net> wrote: > > > > Hi all, > > > > > > > > In NumPy 1.21, the output of `np.unique` changed in the presence > > > > of > > > > multiple NaNs. Previously, all NaNs were returned when we now > > > > only > > > > return one (all NaNs were considered unique): > > > > > > > > a = np.array([1, 1, np.nan, np.nan, np.nan]) > > > > > > > > Before 1.21: > > > > > > > > >>> np.unique(a) > > > > array([ 1., nan, nan, nan]) > > > > > > > > After 1.21: > > > > > > > > array([ 1., nan]) > > > > > > > > > > > > This change was requested in an old issue: > > > > > > > > https://github.com/numpy/numpy/issues/2111 > > > > > > > > And happened here: > > > > > > > > https://github.com/numpy/numpy/pull/18070 > > > > > > > > While, it has a release note. I am not sure the change got the > > > > attention it deserved. This would be especially worrying if it > > > > is a > > > > regression for anyone? > > > > > > I think it's now the expected answer, not a regression. `unique` is > > > not an elementwise function that needs to adhere to IEEE-754 where > > > nan != nan. I can't remember reviewing this change, but it makes > > > perfect sense to me. > > > > It turns out there's a lot more to this story. The short summary of > > it is that: > > - use case wise it's still true that considering NaN's equal makes > > more sense. > > - scikit-learn has a custom implementation which removes duplicate > > NaN's: > > https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode.py#L7 > > - all other array/tensor libraries match NumPy's old behavior (return > > multiple NaN's) > > - there's a performance and implementation cost for CuPy et al. to > > match NumPy's changed behavior (multiple NaN's is the natural result, > > no extra checks needed) > > - There is also a significant performance cost for NumPy, ~25% for > > small/medium-size arrays with no/few NaN's (say the benchmarks in > > https://github.com/numpy/numpy/pull/18070 - which is a common case, > > and that's not negligible like the PR description claims. > > - the "single NaN" behavior is easy to get as a utility function on > > top of the "multiple NaN' (like scikit-learn does), the opposite is > > much harder > > - for the array API standard, the final decision was to go with the > > "multiple NaN' behavior, so we'll need that in `numpy.array_api`. > > - more discussion in > > https://github.com/data-apis/array-api/issues/249 > > > > It would be good to make up our minds before 1.22.0. Two choices: > > 1. We can leave it as it is now and work around the issue in > > `numpy.array_api`. It would also require CuPy and others to make > > changes, which probably cost performance. > > 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a > > release will be made. > > > > There is something to say for either option. I'd personally lean > > slightly towards reverting because of both consistency with other > > implementations and performance. But it seems like there's no optimal > > solution here, it's a fairly hairy issue. > > > > Thoughts? > > > > Ralf > > > > _______________________________________________ > > NumPy-Discussion mailing list -- numpy-discussion@python.org > > To unsubscribe send an email to numpy-discussion-le...@python.org > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > > Member address: j...@fastmail.com > > > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: sebast...@sipsolutions.net
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com