[ 
https://issues.apache.org/jira/browse/LUCENE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-6201:
---------------------------------
    Attachment: LUCENE-6201.patch

I have been working on an alternative implementation that tries to generalize 
how our disjunction and conjunction scorers are working. It keeps track of 
scorers in 3 different data-structures:
 - a linked-list of scorers that are positioned on the next potential match, 
called 'lead' since they are used to lead the iteration
 - a heap of scorers that are beyond the next potential match called 'head', 
ordered by doc ID (like DisjunctionScorer)
 - a heap of scorers that are behind the next potential match called 'tail', 
ordered by cost (like ConjunctionScorer, although the ConjunctionScorer case is 
simpler since the set of scorers does not change it can just use a sorted 
array). This heap has a size of at most {{minShouldMatch - 1}}, which 
guarantees that they can't be a match on their own (since we need at least 
{{minShouldMatch}} matching clauses).

When you want to move to the next document, you first move scorers from 'lead' 
to 'tail'. And if it overflows, you just have to advance the least-costly 
scorers to 'head'. Then the next potential match is the doc ID at the top of 
'head' and we need to pop 'head' from all scorers which are on this doc ID. 
Finally, we just have to advance the least-costly scorer from 'tail' until 
there is a match.

I ran benchmarks with the current implementation using the tasks file from 
LUCENE-4571. Some queries are slower, other queries are faster:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
     Low3MinShouldMatch2       41.11      (7.3%)       34.29      (3.6%)  
-16.6% ( -25% -   -6%)
     Low2MinShouldMatch2       24.18      (7.4%)       21.28      (2.8%)  
-12.0% ( -20% -   -1%)
     Low1MinShouldMatch2       17.92      (7.1%)       16.26      (3.1%)   
-9.3% ( -18% -    0%)
     HighMinShouldMatch4       23.14      (6.3%)       21.13      (3.4%)   
-8.7% ( -17% -    1%)
     HighMinShouldMatch3       17.01      (6.9%)       15.73      (2.9%)   
-7.5% ( -16% -    2%)
     Low1MinShouldMatch3       23.20      (6.9%)       21.49      (3.1%)   
-7.4% ( -16% -    2%)
     HighMinShouldMatch2       14.48      (6.9%)       13.63      (3.4%)   
-5.8% ( -15% -    4%)
     Low4MinShouldMatch2      327.94      (2.6%)      312.58      (2.4%)   
-4.7% (  -9% -    0%)
     Low2MinShouldMatch3       39.53      (7.1%)       38.77      (3.2%)   
-1.9% ( -11% -    9%)
     Low4MinShouldMatch0       73.34      (3.3%)       72.17      (2.1%)   
-1.6% (  -6% -    3%)
     Low2MinShouldMatch0       36.11      (2.1%)       35.62      (1.5%)   
-1.4% (  -4% -    2%)
     Low1MinShouldMatch4       41.57      (6.1%)       41.01      (3.3%)   
-1.4% ( -10% -    8%)
     Low3MinShouldMatch0       48.49      (2.1%)       47.90      (1.7%)   
-1.2% (  -4% -    2%)
     Low3MinShouldMatch3      311.34      (8.0%)      309.54      (2.4%)   
-0.6% ( -10% -   10%)
     Low1MinShouldMatch0       30.28      (2.0%)       30.14      (1.1%)   
-0.5% (  -3% -    2%)
     HighMinShouldMatch0       26.09      (1.7%)       25.99      (1.1%)   
-0.4% (  -3% -    2%)
                PKLookup      322.05      (3.3%)      323.17      (3.3%)    
0.3% (  -6% -    7%)
     Low2MinShouldMatch4      362.28      (5.7%)      366.96      (3.0%)    
1.3% (  -6% -   10%)
     Low4MinShouldMatch4     1380.17      (6.7%)     1541.42     (11.0%)   
11.7% (  -5% -   31%)
     Low3MinShouldMatch4     1299.86      (6.4%)     1506.99      (4.7%)   
15.9% (   4% -   28%)
     Low4MinShouldMatch3     1060.15      (6.0%)     1233.64      (3.7%)   
16.4% (   6% -   27%)
{noformat}

This implementation is very careful about not advancing more than needed which 
is sometimes not the right trade-off on term queries since they are so fast. I 
tried to measure how many times nextDoc and advance are called for the extreme 
queries from this benchmark: Low3MinShouldMatch2 (-16.6%) and 
Low4MinShouldMatch3 (16.4%).

|| Low3MinShouldMatch2 || trunk || patch || diff ||
| nextDoc | 3317417 | 2385754 | -28% |
| advance | 2565471 | 3196711 | +25% |
| total | 5882888 | 5582465 | -5% |

|| Low4MinShouldMatch3 || trunk || patch ||
| nextDoc | 86588 | 320 | -99% |
| advance | 20644 | 74305 | +260% |
| total | 107232 | 74625 | -30% |

Overall this new impl seems to especially help on queries which have 
low-frequency clauses and high values of minShouldMatch where its logic helps 
save calls to nextDoc/advance. When it does not help save many nextDoc/advance 
calls like in Low3MinShouldMatch2, its constant overhead makes it slower.

The other interesting part is that it scores lazily, so I hacked luceneutil to 
wrap the parsed boolean queries into constant-score queries, and this time the 
difference is even better since the current MinShouldMatchSumScorer always 
computes scores:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
     Low1MinShouldMatch0       28.88      (1.1%)       28.43      (1.4%)   
-1.5% (  -3% -    0%)
     Low2MinShouldMatch0       33.95      (1.2%)       33.51      (1.4%)   
-1.3% (  -3% -    1%)
     HighMinShouldMatch0       24.83      (1.4%)       24.60      (1.5%)   
-0.9% (  -3% -    2%)
     Low3MinShouldMatch0       44.95      (1.4%)       44.63      (1.4%)   
-0.7% (  -3% -    2%)
     Low4MinShouldMatch0       66.11      (1.5%)       65.67      (1.6%)   
-0.7% (  -3% -    2%)
                PKLookup      321.97      (2.7%)      323.22      (2.5%)    
0.4% (  -4% -    5%)
     Low4MinShouldMatch2      324.15      (3.3%)      326.37      (2.0%)    
0.7% (  -4% -    6%)
     Low3MinShouldMatch3      312.13      (4.5%)      319.32      (3.0%)    
2.3% (  -4% -   10%)
     Low2MinShouldMatch4      362.11      (4.0%)      371.42      (3.5%)    
2.6% (  -4% -   10%)
     Low3MinShouldMatch2       40.80      (7.1%)       42.67      (4.3%)    
4.6% (  -6% -   17%)
     Low1MinShouldMatch4       41.61      (7.2%)       43.63      (4.2%)    
4.9% (  -6% -   17%)
     HighMinShouldMatch4       22.82      (6.9%)       24.05      (4.9%)    
5.4% (  -6% -   18%)
     Low2MinShouldMatch3       39.52      (7.5%)       43.15      (4.5%)    
9.2% (  -2% -   22%)
     Low1MinShouldMatch3       22.82      (7.3%)       25.56      (5.5%)   
12.0% (   0% -   26%)
     Low2MinShouldMatch2       23.75      (7.3%)       27.33      (4.9%)   
15.1% (   2% -   29%)
     Low4MinShouldMatch4     1291.50      (7.9%)     1492.81      (4.2%)   
15.6% (   3% -   30%)
     Low4MinShouldMatch3     1013.22      (4.4%)     1193.51      (4.8%)   
17.8% (   8% -   28%)
     HighMinShouldMatch3       16.65      (6.8%)       19.98      (6.1%)   
20.0% (   6% -   35%)
     Low3MinShouldMatch4     1188.07     (12.2%)     1445.02      (5.5%)   
21.6% (   3% -   44%)
     Low1MinShouldMatch2       17.45      (6.6%)       21.64      (6.1%)   
24.1% (  10% -   39%)
     HighMinShouldMatch2       14.02      (6.2%)       18.54      (7.2%)   
32.3% (  17% -   48%)
{noformat}

But actually it does not only score lazily, it also stops advancing the tail as 
soon as there are {{minShouldMatch}}. For instance, I measured how many times 
nextDoc and advance are called depending on whether we are scoring or not with 
the patch.

|| HighMinShouldMatch3 || scoring || non scoring ||
| nextDoc | 4025282 | 4135163 | +3% |
| advance | 8329393 | 6828807 | -18% |
| total | 12354675 | 10963970 | -11% |

> MinShouldMatchSumScorer should advance less and score lazily
> ------------------------------------------------------------
>
>                 Key: LUCENE-6201
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6201
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: Trunk, 5.1
>
>         Attachments: LUCENE-6201.patch
>
>
> MinShouldMatchSumScorer currently computes the score eagerly, even on 
> documents that do not eventually match if it cannot find {{minShouldMatch}} 
> matches on the same document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to