[
https://issues.apache.org/jira/browse/LUCENE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6201:
---------------------------------
Attachment: LUCENE-6201.patch
I have been working on an alternative implementation that tries to generalize
how our disjunction and conjunction scorers are working. It keeps track of
scorers in 3 different data-structures:
- a linked-list of scorers that are positioned on the next potential match,
called 'lead' since they are used to lead the iteration
- a heap of scorers that are beyond the next potential match called 'head',
ordered by doc ID (like DisjunctionScorer)
- a heap of scorers that are behind the next potential match called 'tail',
ordered by cost (like ConjunctionScorer, although the ConjunctionScorer case is
simpler since the set of scorers does not change it can just use a sorted
array). This heap has a size of at most {{minShouldMatch - 1}}, which
guarantees that they can't be a match on their own (since we need at least
{{minShouldMatch}} matching clauses).
When you want to move to the next document, you first move scorers from 'lead'
to 'tail'. And if it overflows, you just have to advance the least-costly
scorers to 'head'. Then the next potential match is the doc ID at the top of
'head' and we need to pop 'head' from all scorers which are on this doc ID.
Finally, we just have to advance the least-costly scorer from 'tail' until
there is a match.
I ran benchmarks with the current implementation using the tasks file from
LUCENE-4571. Some queries are slower, other queries are faster:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
Low3MinShouldMatch2 41.11 (7.3%) 34.29 (3.6%)
-16.6% ( -25% - -6%)
Low2MinShouldMatch2 24.18 (7.4%) 21.28 (2.8%)
-12.0% ( -20% - -1%)
Low1MinShouldMatch2 17.92 (7.1%) 16.26 (3.1%)
-9.3% ( -18% - 0%)
HighMinShouldMatch4 23.14 (6.3%) 21.13 (3.4%)
-8.7% ( -17% - 1%)
HighMinShouldMatch3 17.01 (6.9%) 15.73 (2.9%)
-7.5% ( -16% - 2%)
Low1MinShouldMatch3 23.20 (6.9%) 21.49 (3.1%)
-7.4% ( -16% - 2%)
HighMinShouldMatch2 14.48 (6.9%) 13.63 (3.4%)
-5.8% ( -15% - 4%)
Low4MinShouldMatch2 327.94 (2.6%) 312.58 (2.4%)
-4.7% ( -9% - 0%)
Low2MinShouldMatch3 39.53 (7.1%) 38.77 (3.2%)
-1.9% ( -11% - 9%)
Low4MinShouldMatch0 73.34 (3.3%) 72.17 (2.1%)
-1.6% ( -6% - 3%)
Low2MinShouldMatch0 36.11 (2.1%) 35.62 (1.5%)
-1.4% ( -4% - 2%)
Low1MinShouldMatch4 41.57 (6.1%) 41.01 (3.3%)
-1.4% ( -10% - 8%)
Low3MinShouldMatch0 48.49 (2.1%) 47.90 (1.7%)
-1.2% ( -4% - 2%)
Low3MinShouldMatch3 311.34 (8.0%) 309.54 (2.4%)
-0.6% ( -10% - 10%)
Low1MinShouldMatch0 30.28 (2.0%) 30.14 (1.1%)
-0.5% ( -3% - 2%)
HighMinShouldMatch0 26.09 (1.7%) 25.99 (1.1%)
-0.4% ( -3% - 2%)
PKLookup 322.05 (3.3%) 323.17 (3.3%)
0.3% ( -6% - 7%)
Low2MinShouldMatch4 362.28 (5.7%) 366.96 (3.0%)
1.3% ( -6% - 10%)
Low4MinShouldMatch4 1380.17 (6.7%) 1541.42 (11.0%)
11.7% ( -5% - 31%)
Low3MinShouldMatch4 1299.86 (6.4%) 1506.99 (4.7%)
15.9% ( 4% - 28%)
Low4MinShouldMatch3 1060.15 (6.0%) 1233.64 (3.7%)
16.4% ( 6% - 27%)
{noformat}
This implementation is very careful about not advancing more than needed which
is sometimes not the right trade-off on term queries since they are so fast. I
tried to measure how many times nextDoc and advance are called for the extreme
queries from this benchmark: Low3MinShouldMatch2 (-16.6%) and
Low4MinShouldMatch3 (16.4%).
|| Low3MinShouldMatch2 || trunk || patch || diff ||
| nextDoc | 3317417 | 2385754 | -28% |
| advance | 2565471 | 3196711 | +25% |
| total | 5882888 | 5582465 | -5% |
|| Low4MinShouldMatch3 || trunk || patch ||
| nextDoc | 86588 | 320 | -99% |
| advance | 20644 | 74305 | +260% |
| total | 107232 | 74625 | -30% |
Overall this new impl seems to especially help on queries which have
low-frequency clauses and high values of minShouldMatch where its logic helps
save calls to nextDoc/advance. When it does not help save many nextDoc/advance
calls like in Low3MinShouldMatch2, its constant overhead makes it slower.
The other interesting part is that it scores lazily, so I hacked luceneutil to
wrap the parsed boolean queries into constant-score queries, and this time the
difference is even better since the current MinShouldMatchSumScorer always
computes scores:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
Low1MinShouldMatch0 28.88 (1.1%) 28.43 (1.4%)
-1.5% ( -3% - 0%)
Low2MinShouldMatch0 33.95 (1.2%) 33.51 (1.4%)
-1.3% ( -3% - 1%)
HighMinShouldMatch0 24.83 (1.4%) 24.60 (1.5%)
-0.9% ( -3% - 2%)
Low3MinShouldMatch0 44.95 (1.4%) 44.63 (1.4%)
-0.7% ( -3% - 2%)
Low4MinShouldMatch0 66.11 (1.5%) 65.67 (1.6%)
-0.7% ( -3% - 2%)
PKLookup 321.97 (2.7%) 323.22 (2.5%)
0.4% ( -4% - 5%)
Low4MinShouldMatch2 324.15 (3.3%) 326.37 (2.0%)
0.7% ( -4% - 6%)
Low3MinShouldMatch3 312.13 (4.5%) 319.32 (3.0%)
2.3% ( -4% - 10%)
Low2MinShouldMatch4 362.11 (4.0%) 371.42 (3.5%)
2.6% ( -4% - 10%)
Low3MinShouldMatch2 40.80 (7.1%) 42.67 (4.3%)
4.6% ( -6% - 17%)
Low1MinShouldMatch4 41.61 (7.2%) 43.63 (4.2%)
4.9% ( -6% - 17%)
HighMinShouldMatch4 22.82 (6.9%) 24.05 (4.9%)
5.4% ( -6% - 18%)
Low2MinShouldMatch3 39.52 (7.5%) 43.15 (4.5%)
9.2% ( -2% - 22%)
Low1MinShouldMatch3 22.82 (7.3%) 25.56 (5.5%)
12.0% ( 0% - 26%)
Low2MinShouldMatch2 23.75 (7.3%) 27.33 (4.9%)
15.1% ( 2% - 29%)
Low4MinShouldMatch4 1291.50 (7.9%) 1492.81 (4.2%)
15.6% ( 3% - 30%)
Low4MinShouldMatch3 1013.22 (4.4%) 1193.51 (4.8%)
17.8% ( 8% - 28%)
HighMinShouldMatch3 16.65 (6.8%) 19.98 (6.1%)
20.0% ( 6% - 35%)
Low3MinShouldMatch4 1188.07 (12.2%) 1445.02 (5.5%)
21.6% ( 3% - 44%)
Low1MinShouldMatch2 17.45 (6.6%) 21.64 (6.1%)
24.1% ( 10% - 39%)
HighMinShouldMatch2 14.02 (6.2%) 18.54 (7.2%)
32.3% ( 17% - 48%)
{noformat}
But actually it does not only score lazily, it also stops advancing the tail as
soon as there are {{minShouldMatch}}. For instance, I measured how many times
nextDoc and advance are called depending on whether we are scoring or not with
the patch.
|| HighMinShouldMatch3 || scoring || non scoring ||
| nextDoc | 4025282 | 4135163 | +3% |
| advance | 8329393 | 6828807 | -18% |
| total | 12354675 | 10963970 | -11% |
> MinShouldMatchSumScorer should advance less and score lazily
> ------------------------------------------------------------
>
> Key: LUCENE-6201
> URL: https://issues.apache.org/jira/browse/LUCENE-6201
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6201.patch
>
>
> MinShouldMatchSumScorer currently computes the score eagerly, even on
> documents that do not eventually match if it cannot find {{minShouldMatch}}
> matches on the same document.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]