[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Adrien Grand (JIRA) Fri, 13 Oct 2017 09:45:21 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203829#comment-16203829
 ]


Adrien Grand commented on LUCENE-4100:
--------------------------------------

bq. indexsearcher already knows if scores are needed (e.g. from the Sort), but 
there is no way to tell it that approximate total hit count is acceptable. If 
we can do that, then I think we can make the early termination case really easy 
for the sorted case, index order case, and also this maxscore case.

+1 We need to add a new {{boolean needsTotalHits}} to the {{search(Query, 
int)}} and {{search(Query, int, Sort)}} methods.

bq. we do? [...] I'm just asking because the new scorer here looks a hell of a 
lot like a disjunction scorer

Well it is a disjunction. :) Our regular disjunction scorer maintains a single 
heap and only looks at the 'top' element and callso {{updateTop}} after calls 
to {{nextDoc}} or {{advance}}. If you want to be able to call {{advance()}} 
sometimes on low-scoring clauses even if only {{nextDoc()}} was called on the 
disjunction, you need to give it the ability to leave some scorers behind, as 
long as the sum of the max scores of scorers that are behind and scorers that 
are positioned on the current candidate is not greater than the minimum 
competitive score (otherwise you might be missing matches). This means you need 
to move scorers between at least 2 data-structures. In practice, we actually 
use 3 of them so that we can also easily differenciate scorers that are on the 
current candidate from scorers that are too advanced. Moving scorers between 
those data-structures has some overhead. For instance here is what I get when I 
benchmark the MaxScoreScorer against master's DisjunctionSumScorer (BS1 is 
disabled in both cases):

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
   HighTermDayOfYearSort       33.34      (7.3%)       16.55      (2.6%)  
-50.4% ( -56% -  -43%)
              OrHighHigh       21.91      (4.1%)       12.36      (1.7%)  
-43.6% ( -47% -  -39%)
               OrHighLow       48.08      (4.0%)       29.27      (2.4%)  
-39.1% ( -43% -  -34%)
               OrHighMed       60.66      (4.0%)       39.32      (2.5%)  
-35.2% ( -40% -  -29%)
                  Fuzzy2      117.30      (7.1%)      101.36      (7.5%)  
-13.6% ( -26% -    1%)
                  Fuzzy1      238.83     (10.7%)      212.80      (7.3%)  
-10.9% ( -26% -    7%)
            OrHighNotLow       70.64      (3.4%)       66.82      (2.0%)   
-5.4% ( -10% -    0%)
            OrHighNotMed       65.59      (3.0%)       62.63      (1.6%)   
-4.5% (  -8% -    0%)
           OrNotHighHigh       34.55      (2.3%)       33.24      (1.1%)   
-3.8% (  -6% -    0%)
           OrHighNotHigh       48.98      (2.4%)       47.17      (1.4%)   
-3.7% (  -7% -    0%)
            OrNotHighMed      107.24      (2.1%)      103.40      (1.1%)   
-3.6% (  -6% -    0%)
                  IntNRQ       26.26     (11.5%)       25.64     (12.3%)   
-2.3% ( -23% -   24%)
              AndHighLow      879.33      (3.7%)      860.16      (3.3%)   
-2.2% (  -8% -    4%)
              AndHighMed      168.90      (1.7%)      165.48      (1.3%)   
-2.0% (  -4% -    1%)
              HighPhrase       20.08      (2.8%)       19.69      (2.7%)   
-2.0% (  -7% -    3%)
         MedSloppyPhrase       15.44      (1.7%)       15.15      (1.8%)   
-1.9% (  -5% -    1%)
             LowSpanNear       44.70      (2.1%)       43.88      (1.9%)   
-1.8% (  -5% -    2%)
               MedPhrase       52.07      (3.2%)       51.16      (3.1%)   
-1.7% (  -7% -    4%)
         LowSloppyPhrase      150.90      (1.5%)      148.62      (1.5%)   
-1.5% (  -4% -    1%)
            OrNotHighLow     1174.47      (3.6%)     1157.52      (4.1%)   
-1.4% (  -8% -    6%)
            HighSpanNear       38.46      (3.0%)       37.92      (2.4%)   
-1.4% (  -6% -    4%)
        HighSloppyPhrase       28.13      (2.4%)       27.76      (2.5%)   
-1.3% (  -6% -    3%)
               LowPhrase      131.67      (1.6%)      130.37      (2.0%)   
-1.0% (  -4% -    2%)
             MedSpanNear       54.75      (3.4%)       54.22      (3.0%)   
-1.0% (  -7% -    5%)
                Wildcard       41.58      (5.1%)       41.23      (5.2%)   
-0.8% ( -10% -    9%)
             AndHighHigh       71.08      (1.5%)       70.61      (1.3%)   
-0.7% (  -3% -    2%)
                 Prefix3       88.11      (6.4%)       87.57      (7.5%)   
-0.6% ( -13% -   14%)
                 MedTerm      235.73      (1.4%)      235.35      (3.7%)   
-0.2% (  -5% -    4%)
                HighTerm      144.06      (1.3%)      143.93      (4.2%)   
-0.1% (  -5% -    5%)
                 LowTerm      711.49      (4.2%)      718.15      (4.7%)    
0.9% (  -7% -   10%)
                 Respell      203.69      (8.6%)      206.72      (7.0%)    
1.5% ( -12% -   18%)
       HighTermMonthSort      196.30      (5.2%)      222.54      (8.2%)   
13.4% (   0% -   28%)
{noformat}

bq.  I do like that in your patch AssertingScorer checks all that stuff, but 
there it is a bit confusing.

Can you elaborate on what you find confusing? This looks similar to how you 
should not call {{Score.score()}} if you passed {{needsScores=false}} to me?

> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, gsoc2014, patch, performance
>             Fix For: 4.9, 6.0
>
>         Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Reply via email to