[
https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Pohl updated LUCENE-4571:
--------------------------------
Attachment: LUCENE-4571.patch
Robert, thank you for the excellent feedback.
I didn't look at the BS in detail for a while, and all you say sounds very
reasonable. My statement "improvement by a factor" was under the assumption
that BS would similarly to BS2 generate all OR-candidates and only afterwards
screen out many of them again due to the minimum-match constraint. If that's
the case and we assume BS to be faster than BS2 by some factor, then it will be
the very same factor faster with a larger collection, whereas an optimized BS2
might become faster than BS because it does not generate many useless
candidates for queries that have only one super-common term (proportional to
document collection size) and a minimum-match constraint of at least 2. Hope
this makes sense now, but of course my assumption might be wrong :)
In fact, I got to think quite a bit about different approaches on how to
implement a minimum-match optimized version of BS2 and converged on
implementation approach 1) due to the others adding other expensive
operations/overhead when getting down to all details. The attached patch
contains a full drop-in replacement for the DisjunctionSumScorer in
DisjunctionSumScorerMM.java and accordingly changes references within
BooleanScorer2. All existing tests pass.
As this scorer is supposed to work with any subclauses (not only TermQuery), I
decided for an implementation that dynamically orders the first mm-1 subscorers
by the next docid, hence exploiting local within-inverted-list docid
distributions. Fixing the mm-1 subscorers on basis of their doc
frequency/sparseness estimation could be better (less heap operations, but no
exploitation of within-list docid distributions), but this is currently only
available for TermQuery as clauses and would hence limit the applicability of
the implementation. Having an API to determine the sparseness of a subscorer
already came up in some other tickets and many other Scorer implementations
could similarly benefit from its availability.
I however share your thinking about not intermingling too many aspects within
one Scorer, making it overly complex and probably less amenable for VM
optimizations (e.g. not as tight loops). This is why I implemented it in a
different class, so you could go ahead and remove the mm-constraint from
DisjunctionSumScorer and use either implementation depending on the given query.
Also, this implementation could still be tested response-time-wise for
different representative queries of interest and different mm-constraint
settings (luceneutil?). I wouldn't be surprised if my implementation is not as
quick as DisjunctionSumScorer when mm=1, but I have anecdotal evidence that it
does a great job (speedups of 2-3) for higher mm (and longer queries).
I don't quite see why the attached implementation would do more heap
operations, but I agree that it could be slower due to lengthier/more complex
loops, a few more if statements etc.
Hope this contribution helps.
> speedup disjunction with minShouldMatch
> ----------------------------------------
>
> Key: LUCENE-4571
> URL: https://issues.apache.org/jira/browse/LUCENE-4571
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Affects Versions: 4.1
> Reporter: Mikhail Khludnev
> Attachments: LUCENE-4571.patch
>
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole
> disjunction, and verifies minShouldMatch condition [on every
> doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
> public int nextDoc() throws IOException {
> assert doc != NO_MORE_DOCS;
> while(true) {
> while (subScorers[0].docID() == doc) {
> if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
> heapAdjust(0);
> } else {
> heapRemoveRoot();
> if (numScorers < minimumNrMatchers) {
> return doc = NO_MORE_DOCS;
> }
> }
> }
> afterNext();
> if (nrMatchers >= minimumNrMatchers) {
> break;
> }
> }
>
> return doc;
> }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the
> heap first, and then push them back advancing behind that top doc. For me the
> question no.1 is there a performance test for minShouldMatch constrained
> disjunction.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]