[ https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057997#comment-17057997 ]
Jim Ferenczi commented on LUCENE-8929: -------------------------------------- Thanks [~sokolov], sorry for the unstructured answer but we touch a lot of interesting topics here ;) > By the way, I did also run a test using luceneutil's "modification timestamp" >field as the index sort and saw similar gains. I think that field is more >tightly correlated with insertion order, and also has much higher cardinality, >so it makes a good counterpoint: I'll post results here later once I can do a >workup. I think this is expected since you can "early terminate" running segments with the results of other running ones. Although my point was more that you could achieve similar (even superior) gain if you just sort the segments beforehand. It could even be not needed to search the segments concurrently if the insertion order matches the sort order. In such a sorted sequential search over segments can be more competitive. > I would have thought merges would interfere with that opto, but I guess for >the most part it works out? The natural order is preserved in tiered merge policy as well so I don't think it's an issue. > I'm not sure what to say about the `LeafFieldComparator` idea - it sounds >powerful, but I am also a bit leery of these complex Comparators - they make >other things more difficult since it becomes challenging to reason about the >sort order "from the outside". I had to resort to some "instanceof" hackery to >restrict consideration to cases where the comparator is numeric, and >extracting the sort value from the comparator is pretty messy too. We pay a >complexity cost here to handle some edge cases of more abstract comparators. Yes, the main intent is not to handle concurrent requests. The best example here is a LongSortField that can skip documents during the collections by comparing the current bottom value with the values indexed in the BKD-tree. It would be easier to implement such optimizations directly in the FieldComparator since it's an abstraction of a queue. > I hear your concern about the non-determinism due to tie-breaking, but I * >think* this is accounted for by including (global) docid in the comparison in >MaxScoreTerminator.LeafState? I may be missing something though. It doesn't >seem we have a good unit test checking for this tiebreak. I'll add to >TestTopFieldCollector.testRandomMaxScoreTermination to make sure that case is >covered Sorry it's just me that looked too quickly. This makes me wonder where the gains are coming from ? If we use all available threads from the beginning with no-ordering we're also "wasting" a lot of cpus. Having different strategies to sort the leaves and to set the number of threads could be dependent on the data and top N size for instance. In the case where the leaves don't intersect, it is preferable to run fewer segments at a time since we're expecting to skip the followers more efficiently if we start from the global min value. > Early Terminating CollectorManager > ---------------------------------- > > Key: LUCENE-8929 > URL: https://issues.apache.org/jira/browse/LUCENE-8929 > Project: Lucene - Core > Issue Type: Sub-task > Reporter: Atri Sharma > Priority: Major > Time Spent: 7h 20m > Remaining Estimate: 0h > > We should have an early terminating collector manager which accurately tracks > hits across all of its collectors and determines when there are enough hits, > allowing all the collectors to abort. > The options for the same are: > 1) Shared total count : Global "scoreboard" where all collectors update their > current hit count. At the end of each document's collection, collector checks > if N > threshold, and aborts if true > 2) State Reporting Collectors: Collectors report their total number of counts > collected periodically using a callback mechanism, and get a proceed or abort > decision. > 1) has the overhead of synchronization in the hot path, 2) can collect > unnecessary hits before aborting. > I am planning to work on 2), unless objections -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org