[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085437#comment-16085437
 ] 

Michael McCandless commented on LUCENE-7905:
--------------------------------------------

bq. You could do it with radix-like sort, but you'd need all the terms from all 
the TermEnums; the pq has the advantage of scanning through them on the fly, 
progressively?

Exactly!  I want a combined algorithm, that uses the "streaming" that the PQ 
gives us, and uses the "don't redundantly compare common prefixes over and over 
again" that radix sort gives us, because if you think about the comparisons 
that PQ is doing to maintain its heap structure, many of them are wasted on 
common prefixes.  The hard part is efficiently computing "all entries in the PQ 
now share a common prefix of 6" as heap entries are updated over time, but 
surely there is a way.

Though I suppose it's gains would be limited in this usage (merge sorting terms 
from all segments), because the small/tiny segments would mess up the common 
prefixes, i.e. the common prefix would often be low or 0 because of them.  But 
if you merge sorted equal sized segments, e.g. what happens when merging 
segments, then it could be powerful.

> Optimizations for OrdinalMap
> ----------------------------
>
>                 Key: LUCENE-7905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7905
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 7.1
>
>         Attachments: LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to