[
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4100:
---------------------------------
Attachment: LUCENE-4100.patch
I have been exploring how we could make this optimization a bit less
inconvenient by not requiring that indices be static or codec-level changes,
even though this might make the optimization a bit less efficient. Here is a
patch that demonstrates the idea:
- {{SimScorer.maxScore()}} returns the maximum score that may be produced by
SimScorer.score(). The BM25 impl returns {{IDF * (k1 + 1)}}. POSITIVE_INFINITY
is a fine return value in case a SimScorer doesn't have an upper bound on the
produced scores.
- {{Scorer.maxScore()}} returns the maximum score that may be produced by
Scorer.score(). For TermScorer, this method delegates to SimScorer.maxScore().
- {{Scorer.setMinCompetitiveScore(float)}} is used to tell the scorer that
documents that produce scores that are less than the given score _may_ safely
be ignored. This method is called by TopScoreDocCollector whenever the top of
the priority queue is updated. Scorer impls are free to ignore it.
- There is a new {{MaxScoreScorer}}, which only works on pure disjunctions and
is largely inspired from {{MinShouldMatchSumScorer}} except that instead of
ensuring that freq >= minShouldMatch, it tries to ensure that sum_of_max_scores
>= minCompetitiveScore.
It is largely untested, but luceneutil finds the same top docs with and without
the patch so I expect it to not be completely broken (I had to disable the
check that hit counts are the same since Maxscore doesn't give total hit
counts). It yields interesting speedups on wikimedium10M, even for
{{OrHighHigh}}:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
HighTermDayOfYearSort 102.75 (4.8%) 57.26 (3.7%)
-44.3% ( -50% - -37%)
LowTerm 708.14 (5.0%) 696.51 (5.4%)
-1.6% ( -11% - 9%)
HighPhrase 68.06 (4.0%) 67.00 (3.7%)
-1.6% ( -8% - 6%)
HighSpanNear 2.84 (3.4%) 2.80 (3.6%)
-1.3% ( -7% - 5%)
MedSpanNear 24.12 (2.5%) 23.90 (2.6%)
-0.9% ( -5% - 4%)
MedPhrase 49.14 (2.8%) 48.81 (2.8%)
-0.7% ( -6% - 5%)
LowSpanNear 20.48 (2.8%) 20.36 (2.9%)
-0.6% ( -6% - 5%)
LowPhrase 24.11 (2.6%) 24.01 (2.5%)
-0.4% ( -5% - 4%)
AndHighMed 316.86 (2.7%) 315.66 (2.7%)
-0.4% ( -5% - 5%)
MedTerm 334.25 (3.1%) 333.01 (3.2%)
-0.4% ( -6% - 6%)
Respell 261.62 (3.4%) 260.86 (5.5%)
-0.3% ( -8% - 8%)
AndHighHigh 62.76 (2.2%) 62.59 (2.4%)
-0.3% ( -4% - 4%)
HighTerm 86.17 (3.9%) 86.24 (2.8%)
0.1% ( -6% - 6%)
IntNRQ 15.03 (6.5%) 15.20 (5.2%)
1.1% ( -9% - 13%)
LowSloppyPhrase 99.72 (3.3%) 100.83 (3.3%)
1.1% ( -5% - 8%)
Prefix3 34.06 (7.2%) 34.48 (5.0%)
1.3% ( -10% - 14%)
Wildcard 159.32 (8.4%) 161.37 (6.8%)
1.3% ( -12% - 18%)
MedSloppyPhrase 42.17 (3.2%) 42.81 (3.0%)
1.5% ( -4% - 7%)
AndHighLow 1059.03 (3.6%) 1075.73 (2.9%)
1.6% ( -4% - 8%)
OrNotHighLow 1774.47 (4.1%) 1805.36 (4.7%)
1.7% ( -6% - 10%)
HighSloppyPhrase 32.84 (4.0%) 33.46 (3.4%)
1.9% ( -5% - 9%)
OrNotHighMed 295.65 (6.4%) 309.03 (4.7%)
4.5% ( -6% - 16%)
HighTermMonthSort 227.22 (9.9%) 241.42 (8.7%)
6.2% ( -11% - 27%)
OrNotHighHigh 43.19 (3.2%) 48.56 (3.1%)
12.4% ( 5% - 19%)
OrHighNotMed 93.61 (3.5%) 113.32 (4.8%)
21.1% ( 12% - 30%)
OrHighNotHigh 13.58 (3.2%) 16.44 (4.9%)
21.1% ( 12% - 30%)
Fuzzy1 190.81 (5.2%) 236.11 (14.4%)
23.7% ( 3% - 45%)
OrHighNotLow 68.12 (3.9%) 85.99 (6.1%)
26.2% ( 15% - 37%)
OrHighHigh 37.43 (5.4%) 51.96 (5.0%)
38.8% ( 26% - 52%)
OrHighMed 55.45 (5.2%) 130.78 (9.0%)
135.8% ( 115% - 158%)
Fuzzy2 24.33 (4.6%) 81.51 (24.2%)
235.0% ( 197% - 276%)
OrHighLow 32.37 (4.4%) 451.34 (47.8%)
1294.2% (1189% - 1408%)
{noformat}
TODO:
- docs, tests
- Add a {{boolean needsTotalHits}} to {{Query.createWeight}} and {{Collector}}
similarly to the existing {{boolean needsScores}}.
- How should we deal with TopDocs instances that don't have correct total hit
counts? Set it to {{-1}}? Or maybe {{-1 - num_collected_docs}} so that users
can get a lower bound of the number of matches?
Follow-ups:
- Require that scores are always positive or null so that such optimizations
are easier to reason about and generalize in the future?
- Call the {{setMinCompetitiveScore}} API from {{TopFieldDocCollector}} too
when the first sort field is the score.
- Propagate {{Scorer.setMinCompetitiveScore(float)}} whenever possible so that
eg. filtered disjunctions could benefit from the optimization too.
- Implement {{Scorer.setMinCompetitiveScore(float)}} on other scorers.
- Store more metadata in order to be able to compute better higher bounds for
the scores, like the maximum term frequency. Maybe even allow
similarity-specific stats?
> Maxscore - Efficient Scoring
> ----------------------------
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs, core/query/scoring, core/search
> Affects Versions: 4.0-ALPHA
> Reporter: Stefan Pohl
> Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood,
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with
> example queries and lucenebench, the package of Mike McCandless, resulting in
> very significant speedups.
> This ticket is to get started the discussion on including the implementation
> into Lucene's codebase. Because the technique requires awareness about it
> from the Lucene user/developer, it seems best to become a contrib/module
> package so that it consciously can be chosen to be used.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]