[ https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299552#comment-15299552 ]
Dawid Weiss edited comment on LUCENE-7299 at 5/25/16 6:38 AM: -------------------------------------------------------------- We have an internal radix sort for sorting (largeish) key buffers (I attach it and hereby put it in the public domain). It was always a gain over anything else we tried. As for common prefixes -- the nice part about it is that when you're descending into same prefix blocks you can disregard those prefixes in comparisons (knowing they have to be equal). We thus "pushed down" the comparison routine to byte block buffers -- see reference to compareAssumingEqualPrefix inside the code. There is also a hook inside byte block list to allow you to retrieve a single byte at a given offset so there's no need to copy keys over and over again (.byteAt). A simple insertion sort at the "leaf" level was always better than trying to stop radix sort earlier. This result was observed was for realistic keys, not random distributions. All these optimizations yielded significant speed boost for us, perhaps they'll be an inspiration of some sort, Adrien. was (Author: dweiss): We have an internal radix sort for sorting (largeish) key buffers (I attach it and hereby put it in the public domain). It was always a gain over anything else we tried. As for common prefixes -- the nice part about it is that when you're descending into same prefix blocks you can disregard those prefixes in comparisons (knowing they have to be equal). We thus "pushed down" the comparison routine to byte block buffers -- see reference to compareAssumingEqualPrefix inside the code. There is also a hook inside byte block list to allow you to retrieve a single byte at a given offset so there's no need to copy keys over and over again (.byteAt). A simple insertion sort at the "leaf" level was always better than trying to stop radix sort earlier. These was for realistic keys, not random distributions. All these optimizations yielded significant speed boost for us, perhaps they'll be an inspiration of some sort, Adrien. > BytesRefHash.sort() should use radix sort? > ------------------------------------------ > > Key: LUCENE-7299 > URL: https://issues.apache.org/jira/browse/LUCENE-7299 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Attachments: ByteBlockListSorter.java, LUCENE-7299.patch > > > Switching DocIdSetBuilder to radix sort helped make things significantly > faster. We should be able to do the same with BytesRefHash.sort()? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org