[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4100:
---------------------------------
    Attachment: LUCENE-4100.patch

I have been exploring how we could make this optimization a bit less 
inconvenient by not requiring that indices be static or codec-level changes, 
even though this might make the optimization a bit less efficient. Here is a 
patch that demonstrates the idea:
 - {{SimScorer.maxScore()}} returns the maximum score that may be produced by 
SimScorer.score(). The BM25 impl returns {{IDF * (k1 + 1)}}. POSITIVE_INFINITY 
is a fine return value in case a SimScorer doesn't have an upper bound on the 
produced scores.
 - {{Scorer.maxScore()}} returns the maximum score that may be produced by 
Scorer.score(). For TermScorer, this method delegates to SimScorer.maxScore().
 - {{Scorer.setMinCompetitiveScore(float)}} is used to tell the scorer that 
documents that produce scores that are less than the given score _may_ safely 
be ignored. This method is called by TopScoreDocCollector whenever the top of 
the priority queue is updated. Scorer impls are free to ignore it.
 - There is a new {{MaxScoreScorer}}, which only works on pure disjunctions and 
is largely inspired from {{MinShouldMatchSumScorer}} except that instead of 
ensuring that freq >= minShouldMatch, it tries to ensure that sum_of_max_scores 
>= minCompetitiveScore.

It is largely untested, but luceneutil finds the same top docs with and without 
the patch so I expect it to not be completely broken (I had to disable the 
check that hit counts are the same since Maxscore doesn't give total hit 
counts). It yields interesting speedups on wikimedium10M, even for 
{{OrHighHigh}}:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
   HighTermDayOfYearSort      102.75      (4.8%)       57.26      (3.7%)  
-44.3% ( -50% -  -37%)
                 LowTerm      708.14      (5.0%)      696.51      (5.4%)   
-1.6% ( -11% -    9%)
              HighPhrase       68.06      (4.0%)       67.00      (3.7%)   
-1.6% (  -8% -    6%)
            HighSpanNear        2.84      (3.4%)        2.80      (3.6%)   
-1.3% (  -7% -    5%)
             MedSpanNear       24.12      (2.5%)       23.90      (2.6%)   
-0.9% (  -5% -    4%)
               MedPhrase       49.14      (2.8%)       48.81      (2.8%)   
-0.7% (  -6% -    5%)
             LowSpanNear       20.48      (2.8%)       20.36      (2.9%)   
-0.6% (  -6% -    5%)
               LowPhrase       24.11      (2.6%)       24.01      (2.5%)   
-0.4% (  -5% -    4%)
              AndHighMed      316.86      (2.7%)      315.66      (2.7%)   
-0.4% (  -5% -    5%)
                 MedTerm      334.25      (3.1%)      333.01      (3.2%)   
-0.4% (  -6% -    6%)
                 Respell      261.62      (3.4%)      260.86      (5.5%)   
-0.3% (  -8% -    8%)
             AndHighHigh       62.76      (2.2%)       62.59      (2.4%)   
-0.3% (  -4% -    4%)
                HighTerm       86.17      (3.9%)       86.24      (2.8%)    
0.1% (  -6% -    6%)
                  IntNRQ       15.03      (6.5%)       15.20      (5.2%)    
1.1% (  -9% -   13%)
         LowSloppyPhrase       99.72      (3.3%)      100.83      (3.3%)    
1.1% (  -5% -    8%)
                 Prefix3       34.06      (7.2%)       34.48      (5.0%)    
1.3% ( -10% -   14%)
                Wildcard      159.32      (8.4%)      161.37      (6.8%)    
1.3% ( -12% -   18%)
         MedSloppyPhrase       42.17      (3.2%)       42.81      (3.0%)    
1.5% (  -4% -    7%)
              AndHighLow     1059.03      (3.6%)     1075.73      (2.9%)    
1.6% (  -4% -    8%)
            OrNotHighLow     1774.47      (4.1%)     1805.36      (4.7%)    
1.7% (  -6% -   10%)
        HighSloppyPhrase       32.84      (4.0%)       33.46      (3.4%)    
1.9% (  -5% -    9%)
            OrNotHighMed      295.65      (6.4%)      309.03      (4.7%)    
4.5% (  -6% -   16%)
       HighTermMonthSort      227.22      (9.9%)      241.42      (8.7%)    
6.2% ( -11% -   27%)
           OrNotHighHigh       43.19      (3.2%)       48.56      (3.1%)   
12.4% (   5% -   19%)
            OrHighNotMed       93.61      (3.5%)      113.32      (4.8%)   
21.1% (  12% -   30%)
           OrHighNotHigh       13.58      (3.2%)       16.44      (4.9%)   
21.1% (  12% -   30%)
                  Fuzzy1      190.81      (5.2%)      236.11     (14.4%)   
23.7% (   3% -   45%)
            OrHighNotLow       68.12      (3.9%)       85.99      (6.1%)   
26.2% (  15% -   37%)
              OrHighHigh       37.43      (5.4%)       51.96      (5.0%)   
38.8% (  26% -   52%)
               OrHighMed       55.45      (5.2%)      130.78      (9.0%)  
135.8% ( 115% -  158%)
                  Fuzzy2       24.33      (4.6%)       81.51     (24.2%)  
235.0% ( 197% -  276%)
               OrHighLow       32.37      (4.4%)      451.34     (47.8%) 
1294.2% (1189% - 1408%)
{noformat}

TODO:
 - docs, tests
 - Add a {{boolean needsTotalHits}} to {{Query.createWeight}} and {{Collector}} 
similarly to the existing {{boolean needsScores}}.
 - How should we deal with TopDocs instances that don't have correct total hit 
counts? Set it to {{-1}}? Or maybe {{-1 - num_collected_docs}} so that users 
can get a lower bound of the number of matches?

Follow-ups:
 - Require that scores are always positive or null so that such optimizations 
are easier to reason about and generalize in the future?
 - Call the {{setMinCompetitiveScore}} API from {{TopFieldDocCollector}} too 
when the first sort field is the score.
 - Propagate {{Scorer.setMinCompetitiveScore(float)}} whenever possible so that 
eg. filtered disjunctions could benefit from the optimization too.
 - Implement {{Scorer.setMinCompetitiveScore(float)}} on other scorers.
 - Store more metadata in order to be able to compute better higher bounds for 
the scores, like the maximum term frequency. Maybe even allow 
similarity-specific stats?

> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, gsoc2014, patch, performance
>             Fix For: 4.9, 6.0
>
>         Attachments: LUCENE-4100.patch, contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to