Hi all,

just a note that I merged the PR with the following semantics:

A new `kind` keyword-only argument:
* `kind=None` uses a memory bound based heuristic to decide
  which method to use
* `kind="table"` uses the new approach (integer arrays only)
* `kind="sort"` forces the old behavior

The new documentation is available at:
https://numpy.org/devdocs/reference/generated/numpy.in1d.html

It seems this addition should be useful in many cases, but if you have
any concern about the choice of API please comment!

Cheers,

Sebastian


On Thu, 2022-06-16 at 06:08 -0700, Sebastian Berg wrote:
> Hi all,
> 
> there is a PR to add a faster path to `np.isin`, that uses a look-up-
> table for all the elements that are included in the haystack
> (`test_elements`):
> 
>     https://github.com/numpy/numpy/pull/12065/files
> 
> Such a table means that the memory overhead can be very significant,
> but the speedup as well, so there was the idea of adding an option to
> pick which version is used.
> 
> The current documentation for this new `method` keyword argument
> would
> be.  So the main questions are:
> 
> * Is there any concern about adding such a new kwarg?
> * Is `method` the best name?  Sorts uses `kind` which may also be
> good
> 
> There is also the smaller question of what heuristic 'auto' would
> use,
> but that can be tweaked at any time.
> 
> ```
>    method : {'auto', 'sort', 'dictionary'}, optional
>          The algorithm to use. This will not affect the final result,
>          but will affect the speed. Default is 'auto'.
> 
>          - If 'sort', will use a mergesort-based approach. This will
> have
>            a memory usage of roughly 6 times the sum of the sizes of
>            `ar1` and `ar2`, not accounting for size of dtypes.
>          - If 'dictionary', will use a key-dictionary approach
> similar
>            to a counting sort. This is only available for boolean and
>            integer arrays. This will have a memory usage of the
>            size of `ar1` plus the max-min value of `ar2`. This tends
>            to be the faster method if the following formula is true:
>            `log10(len(ar2)) > (log10(max(ar2)-min(ar2)) - 2.27) /
> 0.927`,
>            but may use greater memory.
>          - If 'auto', will automatically choose the method which is
>            expected to perform the fastest, using the above
>            formula. For larger sizes or smaller range,
>            'dictionary' is chosen. For larger range or smaller
>            sizes, 'sort' is chosen.`
> ```
> 
> Cheers,
> 
> Sebastian
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sebast...@sipsolutions.net

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to