Re: minShouldMatch can't leapfrog

2012-11-23 Thread Mikhail Khludnev
Hello Developers,

I appreciate your feedback. I raised the ticket for the last one which is
more common and reachable gain.
https://issues.apache.org/jira/browse/LUCENE-4571
Stefan kindly agreed to add more details into it.

Happy Black Friday!


On Tue, Nov 13, 2012 at 11:59 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Developers,

 I want to discuss few points regarding disjunction form of BooleanQuery
 with minShouldMatch constraint. I'm talking about doc-at-time evaluation
 only (BooleanScorer2).
 Look into conjunction query which has disjunction in one of its' clauses
 e.g. +foo +(bar baz …). if disjunction (bar baz … ) has high
 minShouldMatch constraint even if conjunction clause +foo is highly
 selective this query performs quite bad. It also happens if instead of +foo
 you have a filter. Once again: it's reasonable that disjunction even with
 high minShouldMatch is expensive, if core disjunction with
 minShouldMatch=1 matches few millions of docs. The problem is that I can't
 speed it up by supplying highly selective filter.
 From my POV there two points in Lucene API which make leapfrog impossible:
 - advance() is obliged to return next matching doc that causes scan in
 nextDoc() loop. It's great to have something like advanceExact(), or return
 some magic value from advance says - failed to advance, propose next for
 leapfrog;
 - Scorer is obliged to jump on the first matching doc after it's created
 that leads to scan many docs in nextDoc() loop;
 - ConjunctiveScorer can't know which of its legs are not able to leapfrog
 and prefer to decline advance()
 - Stefan spotted one more gain for minShouldMatch in DisjunctionSumScorer:
 We don't need to walk through top of heap after top scorer is advanced.
 Instead of this, let's pop minShouldMatch scores from the heap, look at the
 top after this - top doc can be evaluated as potential match which exceeds
 minShouldMatch constraint. After that, let's push those guys back to the
 heap advancing them to that candidate docnum. It's not cohesioned with
 leapfrog problem, though.

 This last one looks impressing, but I'm not smart enough to realize that
 it gives performance gain. Do you think it's valuable optimization for
 Lucene users? How minShouldMatch is popular,btw? To be honest I'm not
 really suffer form minShouldMatch itself, I have query with my own match
 verification logic and therefore lack of leapfrog bothers me much.

 Looking forward for you feedback!
 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com





-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


minShouldMatch can't leapfrog

2012-11-13 Thread Mikhail Khludnev
Developers,

I want to discuss few points regarding disjunction form of BooleanQuery
with minShouldMatch constraint. I'm talking about doc-at-time evaluation
only (BooleanScorer2).
Look into conjunction query which has disjunction in one of its' clauses
e.g. +foo +(bar baz …). if disjunction (bar baz … ) has high
minShouldMatch constraint even if conjunction clause +foo is highly
selective this query performs quite bad. It also happens if instead of +foo
you have a filter. Once again: it's reasonable that disjunction even with
high minShouldMatch is expensive, if core disjunction with
minShouldMatch=1 matches few millions of docs. The problem is that I can't
speed it up by supplying highly selective filter.
From my POV there two points in Lucene API which make leapfrog impossible:
- advance() is obliged to return next matching doc that causes scan in
nextDoc() loop. It's great to have something like advanceExact(), or return
some magic value from advance says - failed to advance, propose next for
leapfrog;
- Scorer is obliged to jump on the first matching doc after it's created
that leads to scan many docs in nextDoc() loop;
- ConjunctiveScorer can't know which of its legs are not able to leapfrog
and prefer to decline advance()
- Stefan spotted one more gain for minShouldMatch in DisjunctionSumScorer:
We don't need to walk through top of heap after top scorer is advanced.
Instead of this, let's pop minShouldMatch scores from the heap, look at the
top after this - top doc can be evaluated as potential match which exceeds
minShouldMatch constraint. After that, let's push those guys back to the
heap advancing them to that candidate docnum. It's not cohesioned with
leapfrog problem, though.

This last one looks impressing, but I'm not smart enough to realize that it
gives performance gain. Do you think it's valuable optimization for Lucene
users? How minShouldMatch is popular,btw? To be honest I'm not really
suffer form minShouldMatch itself, I have query with my own match
verification logic and therefore lack of leapfrog bothers me much.

Looking forward for you feedback!
-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com