[jira] [Comment Edited] (LUCENE-7299) BytesRefHash.sort() should use radix sort?

Dawid Weiss (JIRA) Tue, 24 May 2016 23:39:31 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299552#comment-15299552
 ]


Dawid Weiss edited comment on LUCENE-7299 at 5/25/16 6:38 AM:
--------------------------------------------------------------

We have an internal radix sort for sorting (largeish) key buffers (I attach it 
and hereby put it in the public domain). It was always a gain over anything 
else we tried.

As for common prefixes -- the nice part about it is that when you're descending 
into same prefix blocks you can disregard those prefixes in comparisons 
(knowing they have to be equal). We thus "pushed down" the comparison routine 
to byte block buffers -- see reference to compareAssumingEqualPrefix inside the 
code. There is also a hook inside byte block list to allow you to retrieve a 
single byte at a given offset so there's no need to copy keys over and over 
again (.byteAt).

A simple insertion sort at the "leaf" level was always better than trying to 
stop radix sort earlier. This result was observed was for realistic keys, not 
random distributions.

All these optimizations yielded significant speed boost for us, perhaps they'll 
be an inspiration of some sort, Adrien.



was (Author: dweiss):
We have an internal radix sort for sorting (largeish) key buffers (I attach it 
and hereby put it in the public domain). It was always a gain over anything 
else we tried.

As for common prefixes -- the nice part about it is that when you're descending 
into same prefix blocks you can disregard those prefixes in comparisons 
(knowing they have to be equal). We thus "pushed down" the comparison routine 
to byte block buffers -- see reference to compareAssumingEqualPrefix inside the 
code. There is also a hook inside byte block list to allow you to retrieve a 
single byte at a given offset so there's no need to copy keys over and over 
again (.byteAt).

A simple insertion sort at the "leaf" level was always better than trying to 
stop radix sort earlier. These was for realistic keys, not random distributions.

All these optimizations yielded significant speed boost for us, perhaps they'll 
be an inspiration of some sort, Adrien.


> BytesRefHash.sort() should use radix sort?
> ------------------------------------------
>
>                 Key: LUCENE-7299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7299
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: ByteBlockListSorter.java, LUCENE-7299.patch
>
>
> Switching DocIdSetBuilder to radix sort helped make things significantly 
> faster. We should be able to do the same with BytesRefHash.sort()?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-7299) BytesRefHash.sort() should use radix sort?

Reply via email to