[ https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504158#comment-13504158 ]
Stefan Pohl commented on LUCENE-4571: ------------------------------------- Mikhail, Robert, thank you for following up on this and I'm happy to provide more details on ideas of how to use the minimumMatch constraint to speed up the execution of disjunctive queries. Currently, one heap is used in disjunctive scorers to keep track of the next docid (next candidate), and only then these candidates are ruled out if below minimum number of terms / percentage of terms. If there is only one stop word or otherwise high docfreq term in the query, this will produce a huge number of candidates that only occur in very few query terms and only afterwards will be ruled out by the constraint again. As Robert pointed out, this leads to long processing times to attain the next valid candidate. Using BooleanScorer will probably only give improvement by a factor (in some cases), but also this scorers would still generate all these useless candidates. It would be much more efficient to actively use the constraint as part of query processing and employ inverted list skipping (leap-frogging), which makes conjunctive queries so efficient, also for these kind of queries. An efficient implementation could look like the following: Assume at least 5 out of 20 query terms have to match and imagine the terms/inverted lists to be sorted by the next docid. Then, the first potential candidate that satisfies the constraint would be the docid at position 5 and all of the inverted list at position 0-4 can be safely advanced and have to match to that docid in order to generate a real candidate worth scoring. The order, in which to try advancing terms 0-4 should probably be guided by the sparsity of the lists (guided by either docfreq, if available, or a heuristic such as the distance to the next docid), that is, inverted list 4 should first be advanced to the docid, then list 3,2,1, if previous advances are successful. Otherwise, the algorithm can start all over and try the next docid at 5th position. This leads to effective leap-frogging on the most sparse lists within the constraint which rules out matches not satisfying the constraint, most of the time without even touching the very high-frequency terms. Improvements all depend on the type of queries and number of docs per index. There are different implementations possible to achieve such a behavior: 1) Using the current heap and pull out the first 5 lists to get to the next candidate docid to which the other 4 then have to be advanced. This probably leads to more heap removal and addition operations than necessary with the next approaches. 2) Always keep the first 5 lists fully sorted and use the heap to keep track of the next docid within lists 6-20. This invariant should simplify implementation and be competitive for small minimumMatch. 3) Use two heaps: A max-heap to keep track of the largest docid within the first 5 inverted lists and a min-heap to keep track of the smallest docid within lists 6-20. This approach looks most efficient. These implementations look like generalizations to pure disjunctive queries, but if there should be any impact they could be implemented as specializations used only for queries with a minimum match constraint. > speedup disjunction with minShouldMatch > ---------------------------------------- > > Key: LUCENE-4571 > URL: https://issues.apache.org/jira/browse/LUCENE-4571 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Affects Versions: 4.1 > Reporter: Mikhail Khludnev > > even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole > disjunction, and verifies minShouldMatch condition [on every > doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]: > {code} > public int nextDoc() throws IOException { > assert doc != NO_MORE_DOCS; > while(true) { > while (subScorers[0].docID() == doc) { > if (subScorers[0].nextDoc() != NO_MORE_DOCS) { > heapAdjust(0); > } else { > heapRemoveRoot(); > if (numScorers < minimumNrMatchers) { > return doc = NO_MORE_DOCS; > } > } > } > afterNext(); > if (nrMatchers >= minimumNrMatchers) { > break; > } > } > > return doc; > } > {code} > [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the > heap first, and then push them back advancing behind that top doc. For me the > question no.1 is there a performance test for minShouldMatch constrained > disjunction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org