[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086295#comment-14086295
]
Da Huang commented on LUCENE-4396:
----------------------------------
Thanks for Mike and Paul's suggestions.
{quote}
It's a bit spooky that collectMore recurses on itself; in theory
there's an adversary that could consume quite a bit of stack right?
Can we refactor that to the equivalent while loop (it's "just" tail
recursion).
{quote}
Ok. Doing collectMore without recursion is easy.
{quote}
Unfortunately the logic for picking which scorer to use looks really
complex; hopefully we can simplify it.
Also, do we really need 3 scorer classes (BS, BAS, BLS) for the
non-DAAT case? Ie, does each really provide a compelling situation
where it's better than the others?
{quote}
Actually, these scorers are still very competitive when clauses are much fewer.
I have done some tests today. Results are as follows.
{code}
Task array bs ll
HighAnd10HighNot 34.7 31.7 53.7*
HighAnd10HighOr 27.0+ -0.9 32.6*
HighAnd10LowNot -33.5 -6.5 -17.5
HighAnd10LowOr -36.7 -2.1 -43.4
HighAnd5HighNot 3.0 -10.0 16.8*
HighAnd5HighOr -11.5 -2.0 -4.2
HighAnd5LowNot -44.5 -9.4 -36.6
HighAnd5LowOr -56.2 -3.4 -61.9
LowAnd10HighNot 18.2+ 18.7+ 20.0*
LowAnd10HighOr 21.2+ 26.1* 6.0
LowAnd10LowNot 19.6* -2.4 5.7
LowAnd10LowOr 13.3* -2.7 -11.1
LowAnd5HighNot 9.1* 6.9+ -2.7
LowAnd5HighOr 7.6 12.5* -9.3
LowAnd5LowNot -0.9 -4.0 -11.0
LowAnd5LowOr -7.5 -3.6 -27.2
Task Good Method
HighAnd10HighNot ll,
HighAnd10HighOr ll, array,
HighAnd10LowNot
HighAnd10LowOr
HighAnd5HighNot ll,
HighAnd5HighOr
HighAnd5LowNot
HighAnd5LowOr
LowAnd10HighNot ll, bs, array,
LowAnd10HighOr bs, array,
LowAnd10LowNot array,
LowAnd10LowOr array,
LowAnd5HighNot array, bs,
LowAnd5HighOr bs,
LowAnd5LowNot
LowAnd5LowOr
================================================
Task array bs ll
HighAnd25HighNot 75.3 79.1 131.5*
HighAnd25HighOr 69.8+ 74.2+ 80.8*
HighAnd25LowNot -1.0 3.8 15.7*
HighAnd25LowOr -3.8 -34.7 -9.1
HighAnd5HighNot 8.1* -2.9 7.8+
HighAnd5HighOr -1.6 -4.1 -12.9
HighAnd5LowNot -37.3 -33.3 -39.1
HighAnd5LowOr -60.8 -42.5 -60.7
LowAnd25HighNot 38.9 40.1 79.4*
LowAnd25HighOr 44.8* 40.2+ 23.5
LowAnd25LowNot 52.7+ 55.7* 39.2+
LowAnd25LowOr 51.1* 50.8+ 23.7
LowAnd5HighNot 10.0+ 12.0* -2.7
LowAnd5HighOr 5.0 8.0* -9.9
LowAnd5LowNot 2.6 4.1* -10.1
LowAnd5LowOr -8.8 -5.1 -29.1
Task Good Method
HighAnd25HighNot ll,
HighAnd25HighOr ll, bs, array,
HighAnd25LowNot ll,
HighAnd25LowOr
HighAnd5HighNot array, ll,
HighAnd5HighOr
HighAnd5LowNot
HighAnd5LowOr
LowAnd25HighNot ll,
LowAnd25HighOr array, bs,
LowAnd25LowNot bs, array, ll,
LowAnd25LowOr array, bs,
LowAnd5HighNot bs, array,
LowAnd5HighOr bs,
LowAnd5LowNot bs,
LowAnd5LowOr
{code}
Now, I'm just using BAS and BLS for cases with MUST, as BS's perfermance is not
very competitive.
Even though BS seems to be a compelling choice for the case LowAnd5HighOr, its
superiority to BAS is not huge.
Besides, BS can make the logics even more complicate, as BS is BulkScorer while
others are Scorer.
If we still need to give up one scorer, I think it would be better to give up
BLS,
as it looks that BAS to have more positive value than BLS.
{quote}
It's not great adding so much
complexity for performance gains of unusual (so many clauses) boolean
queries...
{quote}
I'm going to just focus the \*And5\* & \*And10\* cases to optimize the perf.
If 10 clauses are still too many, I will just focus on the \*And5\* cases.
Besides, today when I did tests to check the correctness, I discovered that
BS, BAS, as well as BLS, are not supposed to deal with pure conjunction.
TestBooleanQueryVisitSubscorers.testConjunction will fail, when these scorers
have only MUST clauses.
This is because these three scorers do not have .getChildren() implemented.
I think we should throw exception when people try to take them as
ConjunctionScorer. Do you agree?
> BooleanScorer should sometimes be used for MUST clauses
> -------------------------------------------------------
>
> Key: LUCENE-4396
> URL: https://issues.apache.org/jira/browse/LUCENE-4396
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks,
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
> LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch,
> luceneutil-score-equal.patch, stat.cpp, stat.cpp, tasks.cpp
>
>
> Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
> If there is one or more MUST clauses we always use BooleanScorer2.
> But I suspect that unless the MUST clauses have very low hit count compared
> to the other clauses, that BooleanScorer would perform better than
> BooleanScorer2. BooleanScorer still has some vestiges from when it used to
> handle MUST so it shouldn't be hard to bring back this capability ... I think
> the challenging part might be the heuristics on when to use which (likely we
> would have to use firstDocID as proxy for total hit count).
> Likely we should also have BooleanScorer sometimes use .advance() on the subs
> in this case, eg if suddenly the MUST clause skips 1000000 docs then you want
> to .advance() all the SHOULD clauses.
> I won't have near term time to work on this so feel free to take it if you
> are inspired!
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]