[jira] [Updated] (LUCENE-4571) speedup disjunction with minShouldMatch

Stefan Pohl (JIRA) Wed, 20 Feb 2013 09:03:13 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stefan Pohl updated LUCENE-4571:
--------------------------------

    Attachment: LUCENE-4571.patch

Robert, thank you for the excellent feedback.

I didn't look at the BS in detail for a while, and all you say sounds very 
reasonable. My statement "improvement by a factor" was under the assumption 
that BS would similarly to BS2 generate all OR-candidates and only afterwards 
screen out many of them again due to the minimum-match constraint. If that's 
the case and we assume BS to be faster than BS2 by some factor, then it will be 
the very same factor faster with a larger collection, whereas an optimized BS2 
might become faster than BS because it does not generate many useless 
candidates for queries that have only one super-common term (proportional to 
document collection size) and a minimum-match constraint of at least 2. Hope 
this makes sense now, but of course my assumption might be wrong :)

In fact, I got to think quite a bit about different approaches on how to 
implement a minimum-match optimized version of BS2 and converged on 
implementation approach 1) due to the others adding other expensive 
operations/overhead when getting down to all details. The attached patch 
contains a full drop-in replacement for the DisjunctionSumScorer in 
DisjunctionSumScorerMM.java and accordingly changes references within 
BooleanScorer2. All existing tests pass.
As this scorer is supposed to work with any subclauses (not only TermQuery), I 
decided for an implementation that dynamically orders the first mm-1 subscorers 
by the next docid, hence exploiting local within-inverted-list docid 
distributions. Fixing the mm-1 subscorers on basis of their doc 
frequency/sparseness estimation could be better (less heap operations, but no 
exploitation of within-list docid distributions), but this is currently only 
available for TermQuery as clauses and would hence limit the applicability of 
the implementation. Having an API to determine the sparseness of a subscorer 
already came up in some other tickets and many other Scorer implementations 
could similarly benefit from its availability.

I however share your thinking about not intermingling too many aspects within 
one Scorer, making it overly complex and probably less amenable for VM 
optimizations (e.g. not as tight loops). This is why I implemented it in a 
different class, so you could go ahead and remove the mm-constraint from 
DisjunctionSumScorer and use either implementation depending on the given query.
Also, this implementation could still be tested response-time-wise for 
different representative queries of interest and different mm-constraint 
settings (luceneutil?). I wouldn't be surprised if my implementation is not as 
quick as DisjunctionSumScorer when mm=1, but I have anecdotal evidence that it 
does a great job (speedups of 2-3) for higher mm (and longer queries).
I don't quite see why the attached implementation would do more heap 
operations, but I agree that it could be slower due to lengthier/more complex 
loops, a few more if statements etc.

Hope this contribution helps.
                
> speedup disjunction with minShouldMatch 
> ----------------------------------------
>
>                 Key: LUCENE-4571
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4571
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.1
>            Reporter: Mikhail Khludnev
>         Attachments: LUCENE-4571.patch
>
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole 
> disjunction, and verifies minShouldMatch condition [on every 
> doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
>   public int nextDoc() throws IOException {
>     assert doc != NO_MORE_DOCS;
>     while(true) {
>       while (subScorers[0].docID() == doc) {
>         if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
>           heapAdjust(0);
>         } else {
>           heapRemoveRoot();
>           if (numScorers < minimumNrMatchers) {
>             return doc = NO_MORE_DOCS;
>           }
>         }
>       }
>       afterNext();
>       if (nrMatchers >= minimumNrMatchers) {
>         break;
>       }
>     }
>     
>     return doc;
>   }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the 
> heap first, and then push them back advancing behind that top doc. For me the 
> question no.1 is there a performance test for minShouldMatch constrained 
> disjunction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-4571) speedup disjunction with minShouldMatch

Reply via email to