[ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203829#comment-16203829 ]
Adrien Grand commented on LUCENE-4100: -------------------------------------- bq. indexsearcher already knows if scores are needed (e.g. from the Sort), but there is no way to tell it that approximate total hit count is acceptable. If we can do that, then I think we can make the early termination case really easy for the sorted case, index order case, and also this maxscore case. +1 We need to add a new {{boolean needsTotalHits}} to the {{search(Query, int)}} and {{search(Query, int, Sort)}} methods. bq. we do? [...] I'm just asking because the new scorer here looks a hell of a lot like a disjunction scorer Well it is a disjunction. :) Our regular disjunction scorer maintains a single heap and only looks at the 'top' element and callso {{updateTop}} after calls to {{nextDoc}} or {{advance}}. If you want to be able to call {{advance()}} sometimes on low-scoring clauses even if only {{nextDoc()}} was called on the disjunction, you need to give it the ability to leave some scorers behind, as long as the sum of the max scores of scorers that are behind and scorers that are positioned on the current candidate is not greater than the minimum competitive score (otherwise you might be missing matches). This means you need to move scorers between at least 2 data-structures. In practice, we actually use 3 of them so that we can also easily differenciate scorers that are on the current candidate from scorers that are too advanced. Moving scorers between those data-structures has some overhead. For instance here is what I get when I benchmark the MaxScoreScorer against master's DisjunctionSumScorer (BS1 is disabled in both cases): {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff HighTermDayOfYearSort 33.34 (7.3%) 16.55 (2.6%) -50.4% ( -56% - -43%) OrHighHigh 21.91 (4.1%) 12.36 (1.7%) -43.6% ( -47% - -39%) OrHighLow 48.08 (4.0%) 29.27 (2.4%) -39.1% ( -43% - -34%) OrHighMed 60.66 (4.0%) 39.32 (2.5%) -35.2% ( -40% - -29%) Fuzzy2 117.30 (7.1%) 101.36 (7.5%) -13.6% ( -26% - 1%) Fuzzy1 238.83 (10.7%) 212.80 (7.3%) -10.9% ( -26% - 7%) OrHighNotLow 70.64 (3.4%) 66.82 (2.0%) -5.4% ( -10% - 0%) OrHighNotMed 65.59 (3.0%) 62.63 (1.6%) -4.5% ( -8% - 0%) OrNotHighHigh 34.55 (2.3%) 33.24 (1.1%) -3.8% ( -6% - 0%) OrHighNotHigh 48.98 (2.4%) 47.17 (1.4%) -3.7% ( -7% - 0%) OrNotHighMed 107.24 (2.1%) 103.40 (1.1%) -3.6% ( -6% - 0%) IntNRQ 26.26 (11.5%) 25.64 (12.3%) -2.3% ( -23% - 24%) AndHighLow 879.33 (3.7%) 860.16 (3.3%) -2.2% ( -8% - 4%) AndHighMed 168.90 (1.7%) 165.48 (1.3%) -2.0% ( -4% - 1%) HighPhrase 20.08 (2.8%) 19.69 (2.7%) -2.0% ( -7% - 3%) MedSloppyPhrase 15.44 (1.7%) 15.15 (1.8%) -1.9% ( -5% - 1%) LowSpanNear 44.70 (2.1%) 43.88 (1.9%) -1.8% ( -5% - 2%) MedPhrase 52.07 (3.2%) 51.16 (3.1%) -1.7% ( -7% - 4%) LowSloppyPhrase 150.90 (1.5%) 148.62 (1.5%) -1.5% ( -4% - 1%) OrNotHighLow 1174.47 (3.6%) 1157.52 (4.1%) -1.4% ( -8% - 6%) HighSpanNear 38.46 (3.0%) 37.92 (2.4%) -1.4% ( -6% - 4%) HighSloppyPhrase 28.13 (2.4%) 27.76 (2.5%) -1.3% ( -6% - 3%) LowPhrase 131.67 (1.6%) 130.37 (2.0%) -1.0% ( -4% - 2%) MedSpanNear 54.75 (3.4%) 54.22 (3.0%) -1.0% ( -7% - 5%) Wildcard 41.58 (5.1%) 41.23 (5.2%) -0.8% ( -10% - 9%) AndHighHigh 71.08 (1.5%) 70.61 (1.3%) -0.7% ( -3% - 2%) Prefix3 88.11 (6.4%) 87.57 (7.5%) -0.6% ( -13% - 14%) MedTerm 235.73 (1.4%) 235.35 (3.7%) -0.2% ( -5% - 4%) HighTerm 144.06 (1.3%) 143.93 (4.2%) -0.1% ( -5% - 5%) LowTerm 711.49 (4.2%) 718.15 (4.7%) 0.9% ( -7% - 10%) Respell 203.69 (8.6%) 206.72 (7.0%) 1.5% ( -12% - 18%) HighTermMonthSort 196.30 (5.2%) 222.54 (8.2%) 13.4% ( 0% - 28%) {noformat} bq. I do like that in your patch AssertingScorer checks all that stuff, but there it is a bit confusing. Can you elaborate on what you find confusing? This looks similar to how you should not call {{Score.score()}} if you passed {{needsScores=false}} to me? > Maxscore - Efficient Scoring > ---------------------------- > > Key: LUCENE-4100 > URL: https://issues.apache.org/jira/browse/LUCENE-4100 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs, core/query/scoring, core/search > Affects Versions: 4.0-ALPHA > Reporter: Stefan Pohl > Labels: api-change, gsoc2014, patch, performance > Fix For: 4.9, 6.0 > > Attachments: LUCENE-4100.patch, LUCENE-4100.patch, > contrib_maxscore.tgz, maxscore.patch > > > At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient > algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, > that I find deserves more attention among Lucene users (and developers). > I implemented a proof of concept and did some performance measurements with > example queries and lucenebench, the package of Mike McCandless, resulting in > very significant speedups. > This ticket is to get started the discussion on including the implementation > into Lucene's codebase. Because the technique requires awareness about it > from the Lucene user/developer, it seems best to become a contrib/module > package so that it consciously can be chosen to be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org