Re: [Numpy-discussion] quickselect

Julian Taylor Wed, 29 May 2013 09:26:44 -0700

On 05/29/2013 06:12 AM, josef.p...@gmail.com wrote:
> On Tue, May 28, 2013 at 6:31 PM, Charles R Harris
> <charlesr.har...@gmail.com> wrote:
>> Hi All,
>>
>> There is a PR adding quickselect to numpy as a function `partition`.
>> Comments on name and exposure in the numpy API are welcome.
>
> I think the name is fine. It's possible to get used to it.
>
> One possible use I can think of is if I want to sort afterwards only
> one of the tails (like largest 15%).
>
> Limitations:
>
> medium: like percentiles it would be nice to partition into more than
> two sets, if that can be done more efficiently than calling the
> function twice.


I'm not aware of an algorithm that is more efficient to generate 
multiple partitions opposed to partitioning twice.
Additional partitions can work on subranges so so smaller the remainign 
range the faster it gets.
I did a coarse benchmark with percentile partitioning multiple times 
instead of sorting once.
partition beats sorting even when you want to select 500 percentiles, so 
we probably can change it completely.
If people frequently select even more we could add a heuristic cut off 
to sort instead.

>
> major:
> it won't work to calculate the median or other quantiles since those 
> interpolate
> np.median: "When N is even, it is the average of the two middle values
> of V_sorted"
>

this works fine, quickselect/partition moves the kth element into its 
final sorted order, so to get the median of even elements you can just 
partition twice, to get the two middle values. Then take the mean as usual.

This is already implemented in the pull request.
It includes the additional optimization that the second middle value is 
just a minimum of the second partition.

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] quickselect

Reply via email to