[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057967#comment-17057967
 ] 

Michael Sokolov commented on LUCENE-8929:
-----------------------------------------

Thanks for the insightful comments, [~jim.ferenczi], you've given me a lot to 
think about! I had not really considered sorting segments: that makes a lot of 
sense when documents are at least roughly inserted in sort order. I would have 
thought merges would interfere with that opto, but I guess for the most part it 
works out? The performance improvements you saw are stunning. It would be great 
if we could get the segment sorting ideas merged into the Lucene code base, no? 
I wonder how we determine when they are applicable though. In Elasticsearch is 
it done based on some a-priori knowledge, or do you analyze the distribution 
and turn on the opto automatically? That would be compelling I think. On the 
other hand, the use case inspiring this does not tend to correlate index sort 
order and insertion order, so I don't think it would benefit as much from 
segment sorting (except due to chance, or in special cases), so I think these 
are really two separate optimizations and issues. We should be sure to 
structure the code in such a way that can accomodate them all and properly 
choose which one to apply. We don't have a formal query planner in Lucene, but 
I guess we are beginning to evolve one.

I think the idea of splitting collectors is a good one, to avoid overmuch 
complexity in a single collector, but there is also a good deal of shared code 
across these. I can give that a try and see what it looks like. 

By the way, I did also run a test using luceneutil's "modification timestamp" 
field as the index sort and saw similar gains. I think that field is more 
tightly correlated with insertion order, and also has much higher cardinality, 
so it makes a good counterpoint: I'll post results here later once I can do a 
workup.

I hear your concern about the non-determinism due to tie-breaking, but I * 
think* this is accounted for by including (global) docid in the comparison in 
MaxScoreTerminator.LeafState? I may be missing something though. It doesn't 
seem we have a good unit test checking for this tiebreak. I'll add to 
TestTopFieldCollector.testRandomMaxScoreTermination to make sure that case is 
covered.

I'm not sure what to say about the `LeafFieldComparator` idea - it sounds 
powerful, but I am also a bit leery of these complex Comparators - they make 
other things more difficult since it becomes challenging to reason about the 
sort order "from the outside". I had to resort to some "instanceof" hackery to 
restrict consideration to cases where the comparator is numeric, and extracting 
the sort value from the comparator is pretty messy too. We pay a complexity 
cost here to handle some edge cases of more abstract comparators.  

> Early Terminating CollectorManager
> ----------------------------------
>
>                 Key: LUCENE-8929
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8929
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Atri Sharma
>            Priority: Major
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to