[jira] [Comment Edited] (LUCENE-4198) Allow codecs to index term impacts

Adrien Grand (JIRA) Fri, 12 Jan 2018 11:17:18 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324415#comment-16324415
 ]


Adrien Grand edited comment on LUCENE-4198 at 1/12/18 7:16 PM:
---------------------------------------------------------------

To give some insight into future work on scorers, here is an untested patch 
(the only tests for now are that luceneutil gives the same hits back) that 
implements some ideas from the BMW paper.

The new {{BlockMaxConjunctionScorer}} skips blocks whose sum of max scores is 
less than the max competitive score, and also skips hits when the score of the 
max scoring clause is less than the minimum required score minus max scores of 
other clauses.

{{WANDScorer}} uses the block max scores to get an upper bound of the score of 
the current candidate, which already helps {{OrHighLow}}. It could also skip 
over blocks when the sum of the max scores is not competitive, but the impl 
needs a bit more work than for conjunctions.

Baseline is LUCENE-4198.patch, patch is LUCENE-4198.patch and 
LUCENE-4198-BMW.patch combined.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                 LowTerm     2365.07      (2.8%)     2313.92      (2.5%)   
-2.2% (  -7% -    3%)
               OrHighMed       73.78      (2.9%)       72.70      (2.5%)   
-1.5% (  -6% -    4%)
   HighTermDayOfYearSort       88.44     (11.4%)       87.15     (13.0%)   
-1.5% ( -23% -   25%)
                HighTerm      650.28      (5.8%)      646.81      (5.7%)   
-0.5% ( -11% -   11%)
                 Respell      228.08      (2.5%)      227.84      (2.4%)   
-0.1% (  -4% -    4%)
                 MedTerm     1189.63      (4.2%)     1189.27      (4.6%)   
-0.0% (  -8% -    9%)
             MedSpanNear       12.21      (5.0%)       12.24      (5.5%)    
0.2% (  -9% -   11%)
            HighSpanNear        7.26      (5.5%)        7.28      (5.8%)    
0.2% ( -10% -   12%)
                Wildcard      108.43      (7.0%)      108.95      (6.8%)    
0.5% ( -12% -   15%)
                 Prefix3      128.80      (8.1%)      129.46      (7.8%)    
0.5% ( -14% -   17%)
       HighTermMonthSort      172.27      (8.0%)      173.28      (8.0%)    
0.6% ( -14% -   18%)
                  Fuzzy2      104.86      (5.7%)      105.79      (6.5%)    
0.9% ( -10% -   13%)
         LowSloppyPhrase       14.80      (5.6%)       14.93      (6.1%)    
0.9% ( -10% -   13%)
             LowSpanNear       95.06      (3.4%)       96.07      (4.2%)    
1.1% (  -6% -    8%)
        HighSloppyPhrase        3.96      (8.6%)        4.02      (9.7%)    
1.6% ( -15% -   21%)
                  IntNRQ       29.80      (7.0%)       30.50      (6.9%)    
2.4% ( -10% -   17%)
                  Fuzzy1      281.25      (4.8%)      288.77      (9.5%)    
2.7% ( -11% -   17%)
         MedSloppyPhrase       53.95      (8.0%)       55.43      (9.0%)    
2.7% ( -13% -   21%)
              OrHighHigh       23.86      (4.1%)       24.70      (2.7%)    
3.5% (  -3% -   10%)
               MedPhrase       42.45      (2.2%)       44.10      (3.2%)    
3.9% (  -1% -    9%)
               LowPhrase       19.57      (2.7%)       20.47      (3.6%)    
4.6% (  -1% -   11%)
              HighPhrase       15.76      (4.1%)       16.91      (5.3%)    
7.3% (  -1% -   17%)
               OrHighLow      209.91      (2.3%)      261.10      (3.5%)   
24.4% (  18% -   30%)
             AndHighHigh       27.22      (2.1%)       47.66      (5.1%)   
75.1% (  66% -   84%)
              AndHighLow      514.84      (3.5%)      920.46      (6.0%)   
78.8% (  66% -   91%)
              AndHighMed       56.15      (2.0%)      107.60      (5.4%)   
91.6% (  82% -  101%)
{noformat}




was (Author: jpountz):
To give some insight into future work on scorers, here is an untested patch 
(the only tests for now are that luceneutil gives the same hits back) that 
implements some ideas from the BMW paper.

The new {{BlockMaxConjunctionScorer}} skips blocks whose sum of max scores is 
less than the max competitive score, and also skips hits when the score of the 
max scoring clause is less than the minimum required score minus max scores of 
other clauses.

{{WANDScorer}} uses the block max scores to get an upper bound of the score of 
the current candidate, which already helps {{OrHighLow}}. It could also skip 
over blocks when the sum of the max scores is not competitive, but the impl 
needs a bit more work than for conjunctions.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                 LowTerm     2365.07      (2.8%)     2313.92      (2.5%)   
-2.2% (  -7% -    3%)
               OrHighMed       73.78      (2.9%)       72.70      (2.5%)   
-1.5% (  -6% -    4%)
   HighTermDayOfYearSort       88.44     (11.4%)       87.15     (13.0%)   
-1.5% ( -23% -   25%)
                HighTerm      650.28      (5.8%)      646.81      (5.7%)   
-0.5% ( -11% -   11%)
                 Respell      228.08      (2.5%)      227.84      (2.4%)   
-0.1% (  -4% -    4%)
                 MedTerm     1189.63      (4.2%)     1189.27      (4.6%)   
-0.0% (  -8% -    9%)
             MedSpanNear       12.21      (5.0%)       12.24      (5.5%)    
0.2% (  -9% -   11%)
            HighSpanNear        7.26      (5.5%)        7.28      (5.8%)    
0.2% ( -10% -   12%)
                Wildcard      108.43      (7.0%)      108.95      (6.8%)    
0.5% ( -12% -   15%)
                 Prefix3      128.80      (8.1%)      129.46      (7.8%)    
0.5% ( -14% -   17%)
       HighTermMonthSort      172.27      (8.0%)      173.28      (8.0%)    
0.6% ( -14% -   18%)
                  Fuzzy2      104.86      (5.7%)      105.79      (6.5%)    
0.9% ( -10% -   13%)
         LowSloppyPhrase       14.80      (5.6%)       14.93      (6.1%)    
0.9% ( -10% -   13%)
             LowSpanNear       95.06      (3.4%)       96.07      (4.2%)    
1.1% (  -6% -    8%)
        HighSloppyPhrase        3.96      (8.6%)        4.02      (9.7%)    
1.6% ( -15% -   21%)
                  IntNRQ       29.80      (7.0%)       30.50      (6.9%)    
2.4% ( -10% -   17%)
                  Fuzzy1      281.25      (4.8%)      288.77      (9.5%)    
2.7% ( -11% -   17%)
         MedSloppyPhrase       53.95      (8.0%)       55.43      (9.0%)    
2.7% ( -13% -   21%)
              OrHighHigh       23.86      (4.1%)       24.70      (2.7%)    
3.5% (  -3% -   10%)
               MedPhrase       42.45      (2.2%)       44.10      (3.2%)    
3.9% (  -1% -    9%)
               LowPhrase       19.57      (2.7%)       20.47      (3.6%)    
4.6% (  -1% -   11%)
              HighPhrase       15.76      (4.1%)       16.91      (5.3%)    
7.3% (  -1% -   17%)
               OrHighLow      209.91      (2.3%)      261.10      (3.5%)   
24.4% (  18% -   30%)
             AndHighHigh       27.22      (2.1%)       47.66      (5.1%)   
75.1% (  66% -   84%)
              AndHighLow      514.84      (3.5%)      920.46      (6.0%)   
78.8% (  66% -   91%)
              AndHighMed       56.15      (2.0%)      107.60      (5.4%)   
91.6% (  82% -  101%)
{noformat}



> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4198) Allow codecs to index term impacts

Reply via email to