[
https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13582453#comment-13582453
]
Michael McCandless commented on LUCENE-4571:
--------------------------------------------
I fixed luceneutil to recognize +minShouldMatch=N, and made a trivial
tasks file:
{noformat}
HighMinShouldMatch4: ref http from name title +minShouldMatch=4
HighMinShouldMatch3: ref http from name title +minShouldMatch=3
HighMinShouldMatch2: ref http from name title +minShouldMatch=2
HighMinShouldMatch0: ref http from name title
Low1MinShouldMatch4: ref http from name dublin +minShouldMatch=4
Low1MinShouldMatch3: ref http from name dublin +minShouldMatch=3
Low1MinShouldMatch2: ref http from name dublin +minShouldMatch=2
Low1MinShouldMatch0: ref http from name dublin
Low2MinShouldMatch4: ref http from wings dublin +minShouldMatch=4
Low2MinShouldMatch3: ref http from wings dublin +minShouldMatch=3
Low2MinShouldMatch2: ref http from wings dublin +minShouldMatch=2
Low2MinShouldMatch0: ref http from wings dublin
Low3MinShouldMatch4: ref http struck wings dublin +minShouldMatch=4
Low3MinShouldMatch3: ref http struck wings dublin +minShouldMatch=3
Low3MinShouldMatch2: ref http struck wings dublin +minShouldMatch=2
Low3MinShouldMatch0: ref http struck wings dublin
Low4MinShouldMatch4: ref restored struck wings dublin +minShouldMatch=4
Low4MinShouldMatch3: ref restored struck wings dublin +minShouldMatch=3
Low4MinShouldMatch2: ref restored struck wings dublin +minShouldMatch=2
Low4MinShouldMatch0: ref restored struck wings dublin
{noformat}
So, each query has 5 terms. High* means all 5 are high freq, Low1*
means one term is low freq and 4 are high, Low2* means 2 terms are low
freq and 3 are high, etc.
I tested on the 10 M doc wikimedium index, and for both base (= trunk)
and comp (= this patch) I forcefully disabled BS1:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
Low3MinShouldMatch2 3.95 (3.5%) 3.00 (2.1%)
-24.1% ( -28% - -19%)
Low1MinShouldMatch2 1.93 (3.1%) 1.50 (2.1%)
-22.4% ( -26% - -17%)
Low2MinShouldMatch2 2.52 (3.4%) 1.96 (2.0%)
-22.3% ( -26% - -17%)
HighMinShouldMatch2 1.62 (3.2%) 1.27 (2.2%)
-21.3% ( -25% - -16%)
HighMinShouldMatch3 1.65 (3.5%) 1.31 (2.3%)
-20.7% ( -25% - -15%)
Low4MinShouldMatch0 6.91 (3.9%) 5.79 (1.6%)
-16.2% ( -20% - -11%)
Low1MinShouldMatch3 1.98 (3.4%) 1.66 (2.3%)
-15.8% ( -20% - -10%)
Low3MinShouldMatch0 3.69 (3.2%) 3.21 (2.1%)
-13.0% ( -17% - -8%)
Low2MinShouldMatch0 2.38 (3.0%) 2.09 (1.9%)
-12.3% ( -16% - -7%)
Low1MinShouldMatch0 1.84 (2.7%) 1.65 (2.2%)
-10.4% ( -14% - -5%)
HighMinShouldMatch0 1.56 (2.9%) 1.41 (2.5%)
-9.8% ( -14% - -4%)
HighMinShouldMatch4 1.67 (3.6%) 1.55 (2.8%)
-7.1% ( -13% - 0%)
Low2MinShouldMatch3 2.64 (3.8%) 2.65 (2.4%)
0.3% ( -5% - 6%)
Low1MinShouldMatch4 2.02 (3.5%) 2.36 (2.8%)
16.8% ( 10% - 23%)
Low4MinShouldMatch2 8.53 (5.3%) 33.74 (5.8%)
295.8% ( 270% - 324%)
Low4MinShouldMatch3 8.56 (5.4%) 44.93 (8.6%)
424.8% ( 389% - 463%)
Low3MinShouldMatch3 4.25 (4.1%) 23.48 (8.8%)
452.7% ( 422% - 485%)
Low4MinShouldMatch4 8.59 (5.2%) 59.53 (11.1%)
593.3% ( 548% - 643%)
Low2MinShouldMatch4 2.68 (3.9%) 21.38 (14.3%)
696.8% ( 653% - 743%)
Low3MinShouldMatch4 4.25 (4.1%) 34.97 (15.4%)
722.5% ( 675% - 773%)
{noformat}
The new scorer is waaay faster when the minShouldMatch constraint is
highly restrictive, i.e. when .advance is being used on only low-freq
terms (I think?). It a bit slower for the no-minShouldMatch case
(*MinShouldMatch0). When .advance is sometimes used on the high freq
terms it's a bit slower than BS2 today.
I ran a 2nd test, this time with BS1 as the baseline. BS1 is faster
than BS2, but indeed it still evaluates all subs and only rules out
minShouldMmatch in the end. I had to turn off luceneutil's score
comparisons since BS1/BS2 produce different scores:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
HighMinShouldMatch2 3.33 (8.8%) 1.30 (0.8%)
-60.9% ( -64% - -56%)
HighMinShouldMatch3 3.35 (8.8%) 1.33 (1.0%)
-60.5% ( -64% - -55%)
Low1MinShouldMatch2 3.79 (8.4%) 1.52 (0.9%)
-59.9% ( -63% - -55%)
HighMinShouldMatch0 3.46 (9.1%) 1.43 (0.4%)
-58.5% ( -62% - -53%)
Low1MinShouldMatch0 4.00 (8.9%) 1.68 (0.4%)
-58.0% ( -61% - -53%)
Low2MinShouldMatch2 4.66 (8.0%) 1.99 (1.0%)
-57.3% ( -61% - -52%)
Low2MinShouldMatch0 4.97 (8.6%) 2.12 (0.4%)
-57.3% ( -61% - -52%)
Low1MinShouldMatch3 3.85 (8.6%) 1.68 (1.1%)
-56.3% ( -60% - -50%)
Low4MinShouldMatch0 13.32 (8.8%) 5.87 (0.5%)
-56.0% ( -59% - -51%)
Low3MinShouldMatch2 6.87 (8.4%) 3.03 (1.0%)
-55.9% ( -60% - -50%)
Low3MinShouldMatch0 7.13 (8.7%) 3.27 (0.5%)
-54.1% ( -58% - -49%)
HighMinShouldMatch4 3.40 (9.0%) 1.57 (1.3%)
-53.9% ( -58% - -47%)
Low2MinShouldMatch3 4.92 (8.4%) 2.67 (1.4%)
-45.8% ( -51% - -39%)
Low1MinShouldMatch4 4.04 (8.9%) 2.37 (1.5%)
-41.2% ( -47% - -33%)
Low4MinShouldMatch2 14.17 (9.1%) 33.77 (2.9%)
138.4% ( 115% - 165%)
Low4MinShouldMatch3 14.28 (9.1%) 44.93 (3.1%)
214.7% ( 185% - 249%)
Low3MinShouldMatch3 7.39 (9.1%) 23.47 (3.7%)
217.4% ( 187% - 253%)
Low2MinShouldMatch4 5.13 (8.8%) 21.36 (5.5%)
316.4% ( 277% - 362%)
Low4MinShouldMatch4 14.30 (9.0%) 59.65 (4.4%)
317.3% ( 278% - 363%)
Low3MinShouldMatch4 7.41 (9.1%) 34.92 (5.2%)
371.2% ( 327% - 424%)
{noformat}
So it looks like BS1 is faster when the minShouldMatch isn't very
restrictive ... somehow we need some heuristics for BQ to pick the
best scorer given the estimated cost/number of hits of each subs vs
the minShouldMatch constraint (only in those cases when BS1 is
"allowed"...).
But given that this new scorer drastically speeds up the BS2 case in the highly
restrictive cases, and only slows it down a bit for the other cases, I
think we should commit the new scorer, and then separately iterate on
the heuristics for when to choose which sub scorer?
> speedup disjunction with minShouldMatch
> ----------------------------------------
>
> Key: LUCENE-4571
> URL: https://issues.apache.org/jira/browse/LUCENE-4571
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Affects Versions: 4.1
> Reporter: Mikhail Khludnev
> Attachments: LUCENE-4571.patch
>
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole
> disjunction, and verifies minShouldMatch condition [on every
> doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
> public int nextDoc() throws IOException {
> assert doc != NO_MORE_DOCS;
> while(true) {
> while (subScorers[0].docID() == doc) {
> if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
> heapAdjust(0);
> } else {
> heapRemoveRoot();
> if (numScorers < minimumNrMatchers) {
> return doc = NO_MORE_DOCS;
> }
> }
> }
> afterNext();
> if (nrMatchers >= minimumNrMatchers) {
> break;
> }
> }
>
> return doc;
> }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the
> heap first, and then push them back advancing behind that top doc. For me the
> question no.1 is there a performance test for minShouldMatch constrained
> disjunction.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]