[
https://issues.apache.org/jira/browse/LUCENE-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6201:
---------------------------------
Attachment: LUCENE-6201.patch
Here is a new patch and a summary of what it is doing:
- improve MinShouldMatchSumScorer to call nextDoc/advance less on sub scorers
- improve MinShouldMatchSumScorer to only score on demand
- make BooleanScorer able to deal with minShouldMatch > 1. The way it works is
that it only scores windows of 2048 documents where at least minShouldMatch
clauses have at least one match.
- make BooleanScorer used for minShouldMatch > 1 when the cost is high (>
maxDoc / 3)
- DisjunctionScorer and MinShouldMatchSumScorer both had a priority queue
ordered by doc ID so I factored it out to a separate class (but not using
oal.util.PriorityQueue since the pluggable comparison function seems to hurt
performance a lot)
Here are results of the various benchmarks that we have:
wikimedium10M:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
Fuzzy2 58.35 (4.4%) 55.53 (6.4%)
-4.8% ( -14% - 6%)
OrHighMed 59.68 (4.0%) 59.11 (4.2%)
-1.0% ( -8% - 7%)
OrHighLow 68.54 (4.0%) 67.91 (4.3%)
-0.9% ( -8% - 7%)
OrHighHigh 21.65 (4.6%) 21.48 (4.8%)
-0.8% ( -9% - 9%)
Fuzzy1 80.72 (3.8%) 80.40 (3.6%)
-0.4% ( -7% - 7%)
HighTerm 105.98 (2.5%) 105.59 (2.8%)
-0.4% ( -5% - 5%)
Wildcard 49.40 (5.3%) 49.24 (5.3%)
-0.3% ( -10% - 10%)
MedTerm 220.04 (2.6%) 219.36 (2.8%)
-0.3% ( -5% - 5%)
OrHighNotLow 92.87 (2.8%) 92.61 (2.8%)
-0.3% ( -5% - 5%)
OrHighNotMed 67.10 (2.3%) 66.92 (2.2%)
-0.3% ( -4% - 4%)
Prefix3 100.43 (5.4%) 100.19 (5.3%)
-0.2% ( -10% - 11%)
PKLookup 272.56 (3.2%) 271.93 (3.3%)
-0.2% ( -6% - 6%)
OrNotHighMed 171.29 (2.1%) 170.92 (2.1%)
-0.2% ( -4% - 4%)
LowTerm 822.27 (5.0%) 821.03 (4.8%)
-0.2% ( -9% - 10%)
MedPhrase 137.90 (2.5%) 137.77 (2.2%)
-0.1% ( -4% - 4%)
OrNotHighHigh 42.48 (1.4%) 42.44 (1.3%)
-0.1% ( -2% - 2%)
AndHighHigh 52.45 (1.3%) 52.43 (1.3%)
-0.0% ( -2% - 2%)
AndHighMed 210.77 (2.1%) 210.83 (2.4%)
0.0% ( -4% - 4%)
LowPhrase 50.86 (3.8%) 50.90 (3.7%)
0.1% ( -7% - 7%)
MedSpanNear 19.83 (5.6%) 19.84 (5.6%)
0.1% ( -10% - 11%)
HighPhrase 18.50 (3.3%) 18.52 (3.4%)
0.1% ( -6% - 7%)
LowSloppyPhrase 41.13 (2.6%) 41.17 (2.2%)
0.1% ( -4% - 5%)
MedSloppyPhrase 60.91 (3.0%) 60.98 (2.4%)
0.1% ( -5% - 5%)
OrHighNotHigh 40.01 (1.2%) 40.06 (1.3%)
0.1% ( -2% - 2%)
IntNRQ 16.17 (6.2%) 16.19 (6.3%)
0.1% ( -11% - 13%)
OrNotHighLow 611.94 (3.1%) 612.86 (3.3%)
0.1% ( -6% - 6%)
HighSpanNear 3.47 (5.8%) 3.48 (6.1%)
0.2% ( -11% - 12%)
HighSloppyPhrase 24.65 (3.8%) 24.71 (3.2%)
0.3% ( -6% - 7%)
LowSpanNear 90.25 (2.8%) 90.61 (3.1%)
0.4% ( -5% - 6%)
Respell 72.37 (3.9%) 72.95 (3.7%)
0.8% ( -6% - 8%)
AndHighLow 821.42 (4.4%) 830.72 (4.0%)
1.1% ( -6% - 9%)
{noformat}
wikimedium10M with lucene (both the baseline and the patched version) patched
to forcefully use BS2 all the time (to validate that sharing the pq between
DisjunctionScorer and MinShouldMatchSumScorer does not hurt):
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
Fuzzy1 72.12 (5.7%) 71.59 (7.2%)
-0.7% ( -12% - 12%)
LowTerm 860.24 (5.5%) 855.31 (5.1%)
-0.6% ( -10% - 10%)
PKLookup 267.21 (2.6%) 266.07 (2.7%)
-0.4% ( -5% - 4%)
OrNotHighMed 244.11 (1.6%) 243.11 (2.0%)
-0.4% ( -3% - 3%)
MedPhrase 226.75 (2.3%) 226.07 (2.3%)
-0.3% ( -4% - 4%)
OrHighNotHigh 47.84 (1.6%) 47.71 (1.8%)
-0.3% ( -3% - 3%)
HighSloppyPhrase 24.65 (2.4%) 24.59 (2.4%)
-0.2% ( -4% - 4%)
OrHighNotMed 53.28 (2.9%) 53.15 (3.1%)
-0.2% ( -6% - 5%)
HighTerm 54.46 (2.3%) 54.36 (2.3%)
-0.2% ( -4% - 4%)
HighPhrase 46.80 (2.6%) 46.73 (2.7%)
-0.2% ( -5% - 5%)
LowPhrase 224.46 (3.1%) 224.16 (3.6%)
-0.1% ( -6% - 6%)
OrHighNotLow 75.21 (3.3%) 75.11 (3.2%)
-0.1% ( -6% - 6%)
OrNotHighHigh 55.94 (1.5%) 55.91 (1.8%)
-0.1% ( -3% - 3%)
OrNotHighLow 932.36 (3.4%) 932.28 (3.2%)
-0.0% ( -6% - 6%)
MedTerm 306.76 (2.3%) 306.77 (2.3%)
0.0% ( -4% - 4%)
AndHighHigh 59.85 (0.9%) 59.87 (0.9%)
0.0% ( -1% - 1%)
Prefix3 29.55 (2.4%) 29.56 (2.5%)
0.0% ( -4% - 5%)
LowSloppyPhrase 35.97 (3.0%) 36.01 (2.7%)
0.1% ( -5% - 5%)
LowSpanNear 219.56 (3.0%) 219.86 (3.2%)
0.1% ( -5% - 6%)
OrHighLow 17.18 (3.2%) 17.21 (4.2%)
0.2% ( -7% - 7%)
IntNRQ 8.77 (2.8%) 8.79 (3.0%)
0.2% ( -5% - 6%)
AndHighMed 184.88 (1.5%) 185.24 (1.5%)
0.2% ( -2% - 3%)
Wildcard 40.48 (2.5%) 40.56 (2.6%)
0.2% ( -4% - 5%)
MedSloppyPhrase 35.40 (2.4%) 35.47 (2.1%)
0.2% ( -4% - 4%)
HighSpanNear 7.32 (4.8%) 7.35 (5.1%)
0.5% ( -8% - 10%)
Respell 58.99 (4.0%) 59.35 (2.7%)
0.6% ( -5% - 7%)
AndHighLow 921.16 (6.8%) 927.79 (4.1%)
0.7% ( -9% - 12%)
MedSpanNear 10.33 (3.3%) 10.41 (3.2%)
0.7% ( -5% - 7%)
Fuzzy2 64.92 (11.8%) 65.56 (10.1%)
1.0% ( -18% - 25%)
OrHighMed 38.65 (2.8%) 39.25 (4.4%)
1.6% ( -5% - 9%)
OrHighHigh 20.06 (2.7%) 20.64 (4.7%)
2.9% ( -4% - 10%)
{noformat}
MinShouldMatch tasks on wikimedium1M:
{noformat}
Low3MinShouldMatch4 1204.57 (7.1%) 944.59 (2.7%)
-21.6% ( -29% - -12%)
Low4MinShouldMatch4 1294.39 (10.7%) 1026.20 (3.8%)
-20.7% ( -31% - -6%)
Low4MinShouldMatch3 993.94 (7.7%) 827.22 (5.4%)
-16.8% ( -27% - -3%)
Low4MinShouldMatch2 314.76 (4.9%) 272.76 (3.4%)
-13.3% ( -20% - -5%)
Low2MinShouldMatch4 345.04 (8.0%) 303.89 (6.4%)
-11.9% ( -24% - 2%)
Low3MinShouldMatch3 304.41 (5.0%) 271.79 (3.3%)
-10.7% ( -18% - -2%)
Low1MinShouldMatch4 39.30 (8.8%) 39.70 (2.7%)
1.0% ( -9% - 13%)
Low4MinShouldMatch0 71.65 (3.2%) 73.38 (6.5%)
2.4% ( -7% - 12%)
Low3MinShouldMatch0 47.45 (2.2%) 49.00 (7.2%)
3.3% ( -5% - 12%)
Low2MinShouldMatch3 37.50 (9.3%) 38.83 (8.4%)
3.5% ( -12% - 23%)
HighMinShouldMatch0 25.55 (1.8%) 26.52 (8.1%)
3.8% ( -6% - 13%)
Low2MinShouldMatch0 35.24 (1.8%) 36.61 (7.3%)
3.9% ( -5% - 13%)
PKLookup 316.26 (2.5%) 328.56 (3.4%)
3.9% ( -1% - 10%)
Low1MinShouldMatch0 29.63 (2.1%) 30.89 (7.6%)
4.2% ( -5% - 14%)
Low3MinShouldMatch2 38.91 (9.6%) 47.22 (8.4%)
21.3% ( 3% - 43%)
HighMinShouldMatch4 21.97 (9.4%) 27.15 (10.2%)
23.5% ( 3% - 47%)
Low1MinShouldMatch3 21.84 (9.5%) 30.69 (10.9%)
40.5% ( 18% - 67%)
Low2MinShouldMatch2 22.69 (9.7%) 35.21 (11.0%)
55.2% ( 31% - 83%)
HighMinShouldMatch3 16.01 (9.7%) 26.07 (12.3%)
62.9% ( 37% - 94%)
Low1MinShouldMatch2 16.86 (9.3%) 29.94 (13.1%)
77.6% ( 50% - 110%)
HighMinShouldMatch2 13.64 (8.6%) 25.99 (14.6%)
90.6% ( 62% - 124%)
{noformat}
On this last benchmark, we can see that the use of BooleanScorer helps slow
queries. Some queries are slower but it looks like it's just because
MinShouldMatchSumScorer gets picked less often.
I think it's ready?
> MinShouldMatchSumScorer should advance less and score lazily
> ------------------------------------------------------------
>
> Key: LUCENE-6201
> URL: https://issues.apache.org/jira/browse/LUCENE-6201
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6201.patch, LUCENE-6201.patch, LUCENE-6201.patch
>
>
> MinShouldMatchSumScorer currently computes the score eagerly, even on
> documents that do not eventually match if it cannot find {{minShouldMatch}}
> matches on the same document.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]