[jira] [Commented] (LUCENE-7299) BytesRefHash.sort() should use radix sort?

Adrien Grand (JIRA) Tue, 24 May 2016 15:02:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299015#comment-15299015
 ]


Adrien Grand commented on LUCENE-7299:
--------------------------------------

Thanks Mike for testing. The flush times look better indeed!

bq. Might you give us some advise on when to use this sort implementation?

I think radix sort is usually appealing as the complexity degrades more 
gracefully when the number of entries to sort increases. For keys that have a 
maximum length of k, its complexity is O(n*k) while a comparison-based sort has 
a complexity of O(n*k*log(n)). So this change should make things even better 
with larger ram buffers (in addition to the fact that larger ram buffers mean 
less merging).

Like Yonik mentioned, an adversary case is when there are long common prefixes 
(because of the k parameter). The implementation has some protection against it 
though by forcing the fall back to intro sort after a given number of levels of 
recursion (currently 19, needs to be tuned) and recursing directly when all 
values fall into the same bucket for the k-th byte. Note that comparison-based 
sorts have the same adversary case since the comparisons need to scan these 
common prefixes as well, but it is less annoying in practice for 
BytesRefHash.sort() since the bottleneck is not the comparisons of values but 
'get'ting the values to compare from the BytesRefHash (because of the random 
access pattern). Radix sort needs to call 'get' about 2*n*k times while 
introsort needs to call it about n*log(n) times, which is why this fall back to 
introsort is still useful.

By the way, I forgot to mention an important implementation detail of this 
impl: since the bottleneck is to get values from the BytesRefHash, I added a 
cache for the first 3 buckets. This helps perform the first 3 levels of 
recursion by getting every value once instead of 2*3=6 when operating normally. 
I measured that this cache accounts for about 1/3 of the speedup, but it has 
the drawback of requiring an int[] of size n. I think it is fine though since 
this cache should be much smaller than the BytesRefHash itself.

bq. Would it make sense for an implementation to extend o.a.l.u.Sorter

I can try but it could not be as generic as the current Sorter impls that 
mostly need a comparison function. To keep it efficient we would probably have 
to enforce a BytesRef representation of the keys.

> BytesRefHash.sort() should use radix sort?
> ------------------------------------------
>
>                 Key: LUCENE-7299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7299
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7299.patch
>
>
> Switching DocIdSetBuilder to radix sort helped make things significantly 
> faster. We should be able to do the same with BytesRefHash.sort()?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7299) BytesRefHash.sort() should use radix sort?

Reply via email to