[jira] [Updated] (LUCENE-6201) MinShouldMatchSumScorer should advance less and score lazily

Adrien Grand (JIRA) Thu, 29 Jan 2015 10:53:01 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-6201:
---------------------------------
    Attachment: LUCENE-6201.patch

Here is a new patch and a summary of what it is doing:
 - improve MinShouldMatchSumScorer to call nextDoc/advance less on sub scorers
 - improve MinShouldMatchSumScorer to only score on demand
 - make BooleanScorer able to deal with minShouldMatch > 1. The way it works is 
that it only scores windows of 2048 documents where at least minShouldMatch 
clauses have at least one match.
 - make BooleanScorer used for minShouldMatch > 1 when the cost is high (> 
maxDoc / 3)
 - DisjunctionScorer and MinShouldMatchSumScorer both had a priority queue 
ordered by doc ID so I factored it out to a separate class (but not using 
oal.util.PriorityQueue since the pluggable comparison function seems to hurt 
performance a lot)


Here are results of the various benchmarks that we have:

wikimedium10M:
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                  Fuzzy2       58.35      (4.4%)       55.53      (6.4%)   
-4.8% ( -14% -    6%)
               OrHighMed       59.68      (4.0%)       59.11      (4.2%)   
-1.0% (  -8% -    7%)
               OrHighLow       68.54      (4.0%)       67.91      (4.3%)   
-0.9% (  -8% -    7%)
              OrHighHigh       21.65      (4.6%)       21.48      (4.8%)   
-0.8% (  -9% -    9%)
                  Fuzzy1       80.72      (3.8%)       80.40      (3.6%)   
-0.4% (  -7% -    7%)
                HighTerm      105.98      (2.5%)      105.59      (2.8%)   
-0.4% (  -5% -    5%)
                Wildcard       49.40      (5.3%)       49.24      (5.3%)   
-0.3% ( -10% -   10%)
                 MedTerm      220.04      (2.6%)      219.36      (2.8%)   
-0.3% (  -5% -    5%)
            OrHighNotLow       92.87      (2.8%)       92.61      (2.8%)   
-0.3% (  -5% -    5%)
            OrHighNotMed       67.10      (2.3%)       66.92      (2.2%)   
-0.3% (  -4% -    4%)
                 Prefix3      100.43      (5.4%)      100.19      (5.3%)   
-0.2% ( -10% -   11%)
                PKLookup      272.56      (3.2%)      271.93      (3.3%)   
-0.2% (  -6% -    6%)
            OrNotHighMed      171.29      (2.1%)      170.92      (2.1%)   
-0.2% (  -4% -    4%)
                 LowTerm      822.27      (5.0%)      821.03      (4.8%)   
-0.2% (  -9% -   10%)
               MedPhrase      137.90      (2.5%)      137.77      (2.2%)   
-0.1% (  -4% -    4%)
           OrNotHighHigh       42.48      (1.4%)       42.44      (1.3%)   
-0.1% (  -2% -    2%)
             AndHighHigh       52.45      (1.3%)       52.43      (1.3%)   
-0.0% (  -2% -    2%)
              AndHighMed      210.77      (2.1%)      210.83      (2.4%)    
0.0% (  -4% -    4%)
               LowPhrase       50.86      (3.8%)       50.90      (3.7%)    
0.1% (  -7% -    7%)
             MedSpanNear       19.83      (5.6%)       19.84      (5.6%)    
0.1% ( -10% -   11%)
              HighPhrase       18.50      (3.3%)       18.52      (3.4%)    
0.1% (  -6% -    7%)
         LowSloppyPhrase       41.13      (2.6%)       41.17      (2.2%)    
0.1% (  -4% -    5%)
         MedSloppyPhrase       60.91      (3.0%)       60.98      (2.4%)    
0.1% (  -5% -    5%)
           OrHighNotHigh       40.01      (1.2%)       40.06      (1.3%)    
0.1% (  -2% -    2%)
                  IntNRQ       16.17      (6.2%)       16.19      (6.3%)    
0.1% ( -11% -   13%)
            OrNotHighLow      611.94      (3.1%)      612.86      (3.3%)    
0.1% (  -6% -    6%)
            HighSpanNear        3.47      (5.8%)        3.48      (6.1%)    
0.2% ( -11% -   12%)
        HighSloppyPhrase       24.65      (3.8%)       24.71      (3.2%)    
0.3% (  -6% -    7%)
             LowSpanNear       90.25      (2.8%)       90.61      (3.1%)    
0.4% (  -5% -    6%)
                 Respell       72.37      (3.9%)       72.95      (3.7%)    
0.8% (  -6% -    8%)
              AndHighLow      821.42      (4.4%)      830.72      (4.0%)    
1.1% (  -6% -    9%)
{noformat}

wikimedium10M with lucene (both the baseline and the patched version) patched 
to forcefully use BS2 all the time (to validate that sharing the pq between 
DisjunctionScorer and MinShouldMatchSumScorer does not hurt):
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                  Fuzzy1       72.12      (5.7%)       71.59      (7.2%)   
-0.7% ( -12% -   12%)
                 LowTerm      860.24      (5.5%)      855.31      (5.1%)   
-0.6% ( -10% -   10%)
                PKLookup      267.21      (2.6%)      266.07      (2.7%)   
-0.4% (  -5% -    4%)
            OrNotHighMed      244.11      (1.6%)      243.11      (2.0%)   
-0.4% (  -3% -    3%)
               MedPhrase      226.75      (2.3%)      226.07      (2.3%)   
-0.3% (  -4% -    4%)
           OrHighNotHigh       47.84      (1.6%)       47.71      (1.8%)   
-0.3% (  -3% -    3%)
        HighSloppyPhrase       24.65      (2.4%)       24.59      (2.4%)   
-0.2% (  -4% -    4%)
            OrHighNotMed       53.28      (2.9%)       53.15      (3.1%)   
-0.2% (  -6% -    5%)
                HighTerm       54.46      (2.3%)       54.36      (2.3%)   
-0.2% (  -4% -    4%)
              HighPhrase       46.80      (2.6%)       46.73      (2.7%)   
-0.2% (  -5% -    5%)
               LowPhrase      224.46      (3.1%)      224.16      (3.6%)   
-0.1% (  -6% -    6%)
            OrHighNotLow       75.21      (3.3%)       75.11      (3.2%)   
-0.1% (  -6% -    6%)
           OrNotHighHigh       55.94      (1.5%)       55.91      (1.8%)   
-0.1% (  -3% -    3%)
            OrNotHighLow      932.36      (3.4%)      932.28      (3.2%)   
-0.0% (  -6% -    6%)
                 MedTerm      306.76      (2.3%)      306.77      (2.3%)    
0.0% (  -4% -    4%)
             AndHighHigh       59.85      (0.9%)       59.87      (0.9%)    
0.0% (  -1% -    1%)
                 Prefix3       29.55      (2.4%)       29.56      (2.5%)    
0.0% (  -4% -    5%)
         LowSloppyPhrase       35.97      (3.0%)       36.01      (2.7%)    
0.1% (  -5% -    5%)
             LowSpanNear      219.56      (3.0%)      219.86      (3.2%)    
0.1% (  -5% -    6%)
               OrHighLow       17.18      (3.2%)       17.21      (4.2%)    
0.2% (  -7% -    7%)
                  IntNRQ        8.77      (2.8%)        8.79      (3.0%)    
0.2% (  -5% -    6%)
              AndHighMed      184.88      (1.5%)      185.24      (1.5%)    
0.2% (  -2% -    3%)
                Wildcard       40.48      (2.5%)       40.56      (2.6%)    
0.2% (  -4% -    5%)
         MedSloppyPhrase       35.40      (2.4%)       35.47      (2.1%)    
0.2% (  -4% -    4%)
            HighSpanNear        7.32      (4.8%)        7.35      (5.1%)    
0.5% (  -8% -   10%)
                 Respell       58.99      (4.0%)       59.35      (2.7%)    
0.6% (  -5% -    7%)
              AndHighLow      921.16      (6.8%)      927.79      (4.1%)    
0.7% (  -9% -   12%)
             MedSpanNear       10.33      (3.3%)       10.41      (3.2%)    
0.7% (  -5% -    7%)
                  Fuzzy2       64.92     (11.8%)       65.56     (10.1%)    
1.0% ( -18% -   25%)
               OrHighMed       38.65      (2.8%)       39.25      (4.4%)    
1.6% (  -5% -    9%)
              OrHighHigh       20.06      (2.7%)       20.64      (4.7%)    
2.9% (  -4% -   10%)
{noformat}

MinShouldMatch tasks on wikimedium1M:
{noformat}
     Low3MinShouldMatch4     1204.57      (7.1%)      944.59      (2.7%)  
-21.6% ( -29% -  -12%)
     Low4MinShouldMatch4     1294.39     (10.7%)     1026.20      (3.8%)  
-20.7% ( -31% -   -6%)
     Low4MinShouldMatch3      993.94      (7.7%)      827.22      (5.4%)  
-16.8% ( -27% -   -3%)
     Low4MinShouldMatch2      314.76      (4.9%)      272.76      (3.4%)  
-13.3% ( -20% -   -5%)
     Low2MinShouldMatch4      345.04      (8.0%)      303.89      (6.4%)  
-11.9% ( -24% -    2%)
     Low3MinShouldMatch3      304.41      (5.0%)      271.79      (3.3%)  
-10.7% ( -18% -   -2%)
     Low1MinShouldMatch4       39.30      (8.8%)       39.70      (2.7%)    
1.0% (  -9% -   13%)
     Low4MinShouldMatch0       71.65      (3.2%)       73.38      (6.5%)    
2.4% (  -7% -   12%)
     Low3MinShouldMatch0       47.45      (2.2%)       49.00      (7.2%)    
3.3% (  -5% -   12%)
     Low2MinShouldMatch3       37.50      (9.3%)       38.83      (8.4%)    
3.5% ( -12% -   23%)
     HighMinShouldMatch0       25.55      (1.8%)       26.52      (8.1%)    
3.8% (  -6% -   13%)
     Low2MinShouldMatch0       35.24      (1.8%)       36.61      (7.3%)    
3.9% (  -5% -   13%)
                PKLookup      316.26      (2.5%)      328.56      (3.4%)    
3.9% (  -1% -   10%)
     Low1MinShouldMatch0       29.63      (2.1%)       30.89      (7.6%)    
4.2% (  -5% -   14%)
     Low3MinShouldMatch2       38.91      (9.6%)       47.22      (8.4%)   
21.3% (   3% -   43%)
     HighMinShouldMatch4       21.97      (9.4%)       27.15     (10.2%)   
23.5% (   3% -   47%)
     Low1MinShouldMatch3       21.84      (9.5%)       30.69     (10.9%)   
40.5% (  18% -   67%)
     Low2MinShouldMatch2       22.69      (9.7%)       35.21     (11.0%)   
55.2% (  31% -   83%)
     HighMinShouldMatch3       16.01      (9.7%)       26.07     (12.3%)   
62.9% (  37% -   94%)
     Low1MinShouldMatch2       16.86      (9.3%)       29.94     (13.1%)   
77.6% (  50% -  110%)
     HighMinShouldMatch2       13.64      (8.6%)       25.99     (14.6%)   
90.6% (  62% -  124%)
{noformat}

On this last benchmark, we can see that the use of BooleanScorer helps slow 
queries. Some queries are slower but it looks like it's just because 
MinShouldMatchSumScorer gets picked less often.

I think it's ready?

> MinShouldMatchSumScorer should advance less and score lazily
> ------------------------------------------------------------
>
>                 Key: LUCENE-6201
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6201
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: Trunk, 5.1
>
>         Attachments: LUCENE-6201.patch, LUCENE-6201.patch, LUCENE-6201.patch
>
>
> MinShouldMatchSumScorer currently computes the score eagerly, even on 
> documents that do not eventually match if it cannot find {{minShouldMatch}} 
> matches on the same document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6201) MinShouldMatchSumScorer should advance less and score lazily

Reply via email to