[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

David Menéndez Hurtado Tue, 28 Jun 2022 10:18:46 -0700

On Tue, 28 Jun 2022, 6:50 pm Ralf Gommers, <ralf.gomm...@gmail.com> wrote:


>
>
>> ```
>>     kind : {None, 'sort', 'table'}, optional
>>
>
> Regarding the name, `'table'` is an implementation detail. The end user
> should not have to care what the data structure is that is used. I suggest
> to use something like "unsorted" and just explain it as the ordering of
> results being undefined, which can give significant performance benefits.
>

But that suggests that, if I wanted them  sorted, I should use the "sorted"
kind, but it is probably faster to do a table unique and sort the results.

There are two concerns from the point of view of the user. One is the
sorting of the results, and the other is memory usage. I suggest adding two
boolean flags, "low_memory" (following sklearn, and I think, scipy too),
and "sorted". Depending on the algorithm, "sorted=True" will perform a
sort, or do nothing.

/David


> Cheers,
> Ralf
>
>         The algorithm to use. This will not affect the final result,
>>         but will affect the speed and memory use. The default, None,
>>         will select automatically based on memory considerations.
>>
>>         * If 'sort', will use a mergesort-based approach.
>>         * If 'table', will use a lookup table approach similar
>>           to a counting sort. This is only available for boolean and
>>           integer arrays. This will have a memory usage of the
>>           size of `ar` plus the max-min value of `ar`. The options
>>           `return_index`, `return_inverse`, `axis`, and `equal_nan`
>>           are unavailable with this option.
>>         * If None, will automatically choose 'table' if possible,
>>           and the required memory allocation is less than or equal to
>>           6 times the size of `ar`. Will otherwise will use 'sort'.
>>           This is done to not use a large amount of memory by default,
>>           even though 'table' may be faster in most cases.
>> ```
>> The method and API are very similar to that merged last week for `isin`:
>> https://github.com/numpy/numpy/pull/12065/. One difference is that
>> `return_counts` required a slightly modified approach–using `bincount`
>> seems to work well for this.
>>
>> I am eager to hear your comments on this new PR.
>>
>> Thanks!
>> Miles
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: ralf.gomm...@googlemail.com
>>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: davidmen...@gmail.com
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

Reply via email to