Hi all, just a note that I merged the PR with the following semantics:
A new `kind` keyword-only argument: * `kind=None` uses a memory bound based heuristic to decide which method to use * `kind="table"` uses the new approach (integer arrays only) * `kind="sort"` forces the old behavior The new documentation is available at: https://numpy.org/devdocs/reference/generated/numpy.in1d.html It seems this addition should be useful in many cases, but if you have any concern about the choice of API please comment! Cheers, Sebastian On Thu, 2022-06-16 at 06:08 -0700, Sebastian Berg wrote: > Hi all, > > there is a PR to add a faster path to `np.isin`, that uses a look-up- > table for all the elements that are included in the haystack > (`test_elements`): > > https://github.com/numpy/numpy/pull/12065/files > > Such a table means that the memory overhead can be very significant, > but the speedup as well, so there was the idea of adding an option to > pick which version is used. > > The current documentation for this new `method` keyword argument > would > be. So the main questions are: > > * Is there any concern about adding such a new kwarg? > * Is `method` the best name? Sorts uses `kind` which may also be > good > > There is also the smaller question of what heuristic 'auto' would > use, > but that can be tweaked at any time. > > ``` > method : {'auto', 'sort', 'dictionary'}, optional > The algorithm to use. This will not affect the final result, > but will affect the speed. Default is 'auto'. > > - If 'sort', will use a mergesort-based approach. This will > have > a memory usage of roughly 6 times the sum of the sizes of > `ar1` and `ar2`, not accounting for size of dtypes. > - If 'dictionary', will use a key-dictionary approach > similar > to a counting sort. This is only available for boolean and > integer arrays. This will have a memory usage of the > size of `ar1` plus the max-min value of `ar2`. This tends > to be the faster method if the following formula is true: > `log10(len(ar2)) > (log10(max(ar2)-min(ar2)) - 2.27) / > 0.927`, > but may use greater memory. > - If 'auto', will automatically choose the method which is > expected to perform the fastest, using the above > formula. For larger sizes or smaller range, > 'dictionary' is chosen. For larger range or smaller > sizes, 'sort' is chosen.` > ``` > > Cheers, > > Sebastian > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: sebast...@sipsolutions.net
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com