[ https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057967#comment-17057967 ]
Michael Sokolov commented on LUCENE-8929: ----------------------------------------- Thanks for the insightful comments, [~jim.ferenczi], you've given me a lot to think about! I had not really considered sorting segments: that makes a lot of sense when documents are at least roughly inserted in sort order. I would have thought merges would interfere with that opto, but I guess for the most part it works out? The performance improvements you saw are stunning. It would be great if we could get the segment sorting ideas merged into the Lucene code base, no? I wonder how we determine when they are applicable though. In Elasticsearch is it done based on some a-priori knowledge, or do you analyze the distribution and turn on the opto automatically? That would be compelling I think. On the other hand, the use case inspiring this does not tend to correlate index sort order and insertion order, so I don't think it would benefit as much from segment sorting (except due to chance, or in special cases), so I think these are really two separate optimizations and issues. We should be sure to structure the code in such a way that can accomodate them all and properly choose which one to apply. We don't have a formal query planner in Lucene, but I guess we are beginning to evolve one. I think the idea of splitting collectors is a good one, to avoid overmuch complexity in a single collector, but there is also a good deal of shared code across these. I can give that a try and see what it looks like. By the way, I did also run a test using luceneutil's "modification timestamp" field as the index sort and saw similar gains. I think that field is more tightly correlated with insertion order, and also has much higher cardinality, so it makes a good counterpoint: I'll post results here later once I can do a workup. I hear your concern about the non-determinism due to tie-breaking, but I * think* this is accounted for by including (global) docid in the comparison in MaxScoreTerminator.LeafState? I may be missing something though. It doesn't seem we have a good unit test checking for this tiebreak. I'll add to TestTopFieldCollector.testRandomMaxScoreTermination to make sure that case is covered. I'm not sure what to say about the `LeafFieldComparator` idea - it sounds powerful, but I am also a bit leery of these complex Comparators - they make other things more difficult since it becomes challenging to reason about the sort order "from the outside". I had to resort to some "instanceof" hackery to restrict consideration to cases where the comparator is numeric, and extracting the sort value from the comparator is pretty messy too. We pay a complexity cost here to handle some edge cases of more abstract comparators. > Early Terminating CollectorManager > ---------------------------------- > > Key: LUCENE-8929 > URL: https://issues.apache.org/jira/browse/LUCENE-8929 > Project: Lucene - Core > Issue Type: Sub-task > Reporter: Atri Sharma > Priority: Major > Time Spent: 7h 20m > Remaining Estimate: 0h > > We should have an early terminating collector manager which accurately tracks > hits across all of its collectors and determines when there are enough hits, > allowing all the collectors to abort. > The options for the same are: > 1) Shared total count : Global "scoreboard" where all collectors update their > current hit count. At the end of each document's collection, collector checks > if N > threshold, and aborts if true > 2) State Reporting Collectors: Collectors report their total number of counts > collected periodically using a callback mechanism, and get a proceed or abort > decision. > 1) has the overhead of synchronization in the hot path, 2) can collect > unnecessary hits before aborting. > I am planning to work on 2), unless objections -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org