[ 
https://issues.apache.org/jira/browse/ARROW-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359990#comment-17359990
 ] 

Kirill Lykov edited comment on ARROW-10899 at 6/9/21, 11:31 AM:
----------------------------------------------------------------

I have to drop this ticket but if someone will take it, I will be glad to 
assist.
Below is a summary of the research I did.

First of all I couldn't find ready to use stable radix sort implementation 
which is fast and generic.
boost::spread_sort is the best available implementation of in-place radix sort 
because it designed to perform well on all the inputs.
On the contrary, vanilla radix sort will perform poorly on some types of 
distributions as shown on the plot below.
spread_sort performs better for two reasons:
* it detect some cases when radix sort will perform poorly and relies on 
boost's pdqsort sorting algorithm
* it uses adaptive size of the radix rather than fixed

What makes spreadsort unstable is that it sometimes relies on pdqsort which is 
unstable and that it is in-place sort.
Hence, it is possible to make it stable by replacing calls to pdqsort to 
stable_sort and by using additional buffer instead of doing in-place sorting.
Another argument for starting from spread sort code is that it is tested on 
different platforms (so less risk for surprises), it works with different 
datatypes.

!differentdistrib.png|width=350,height=350!

Beside of developing a new efficient implementation of adaptive radix sort, 
there is also work need to be done to integrate this sort. Since right now a 
custom comparator is used with std::stable_sort.

Having in mind also extensive testing/bmk (different distributions, different 
data types) and profiling, I would estimate this task for 80-120h


was (Author: klykov):
I have to drop this ticket but if someone will take it, I will be glad to 
assist.
Below is a summary of the research I did.

First of all I couldn't find ready to use stable radix sort implementation 
which is fast and generic.
boost::spread_sort is the best available implementation of in-place radix sort 
because it designed to perform well on all the inputs.
On the contrary, vanilla radix sort will perform poorly on some types of 
distributions as shown on the plot below.
spread_sort performs better for two reasons:
* it detect some cases when radix sort will perform poorly and relies on 
boost's pdqsort sorting algorithm
* it uses adaptive size of the radix rather than fixed

What makes spreadsort unstable is that it sometimes relies on pdqsort which is 
unstable and that it is in-place sort.
Hence, it is possible to make it stable by replacing calls to pdqsort to 
stable_sort and by using additional buffer instead of doing in-place sorting.

!differentdistrib.png|width=350,height=350!

> [C++] Investigate radix sort for integer arrays
> -----------------------------------------------
>
>                 Key: ARROW-10899
>                 URL: https://issues.apache.org/jira/browse/ARROW-10899
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++
>            Reporter: Antoine Pitrou
>            Assignee: Kirill Lykov
>            Priority: Major
>         Attachments: Screen Shot 2021-02-09 at 17.48.13.png, Screen Shot 
> 2021-02-10 at 10.58.23.png, all_random_wholeRange.png, 
> all_random_wholeRange.png, all_random_wholeRange.png, differentdistrib.png
>
>
> For integer arrays with a non-tiny range of values, we currently use a stable 
> sort. It may be faster to use a radix sort instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to