[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

Da Huang (JIRA) Tue, 05 Aug 2014 07:12:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086295#comment-14086295
 ]


Da Huang commented on LUCENE-4396:
----------------------------------

Thanks for Mike and Paul's suggestions.

{quote}
It's a bit spooky that collectMore recurses on itself; in theory
there's an adversary that could consume quite a bit of stack right?
Can we refactor that to the equivalent while loop (it's "just" tail
recursion).
{quote}
Ok. Doing collectMore without recursion is easy.

{quote}
Unfortunately the logic for picking which scorer to use looks really
complex; hopefully we can simplify it.
Also, do we really need 3 scorer classes (BS, BAS, BLS) for the
non-DAAT case? Ie, does each really provide a compelling situation
where it's better than the others? 
{quote}
Actually, these scorers are still very competitive when clauses are much fewer.
I have done some tests today. Results are as follows.

{code}
                Task        array           bs           ll
    HighAnd10HighNot        34.7         31.7         53.7*
     HighAnd10HighOr        27.0+        -0.9         32.6*
     HighAnd10LowNot       -33.5         -6.5        -17.5 
      HighAnd10LowOr       -36.7         -2.1        -43.4 
     HighAnd5HighNot         3.0        -10.0         16.8*
      HighAnd5HighOr       -11.5         -2.0         -4.2 
      HighAnd5LowNot       -44.5         -9.4        -36.6 
       HighAnd5LowOr       -56.2         -3.4        -61.9 
     LowAnd10HighNot        18.2+        18.7+        20.0*
      LowAnd10HighOr        21.2+        26.1*         6.0 
      LowAnd10LowNot        19.6*        -2.4          5.7 
       LowAnd10LowOr        13.3*        -2.7        -11.1 
      LowAnd5HighNot         9.1*         6.9+        -2.7 
       LowAnd5HighOr         7.6         12.5*        -9.3 
       LowAnd5LowNot        -0.9         -4.0        -11.0 
        LowAnd5LowOr        -7.5         -3.6        -27.2 

                Task         Good Method
    HighAnd10HighNot       ll, 
     HighAnd10HighOr       ll, array, 
     HighAnd10LowNot       
      HighAnd10LowOr       
     HighAnd5HighNot       ll, 
      HighAnd5HighOr       
      HighAnd5LowNot       
       HighAnd5LowOr       
     LowAnd10HighNot       ll, bs, array, 
      LowAnd10HighOr       bs, array, 
      LowAnd10LowNot       array, 
       LowAnd10LowOr       array, 
      LowAnd5HighNot       array, bs, 
       LowAnd5HighOr       bs, 
       LowAnd5LowNot       
        LowAnd5LowOr       
================================================
                Task        array           bs           ll
    HighAnd25HighNot        75.3         79.1        131.5*
     HighAnd25HighOr        69.8+        74.2+        80.8*
     HighAnd25LowNot        -1.0          3.8         15.7*
      HighAnd25LowOr        -3.8        -34.7         -9.1 
     HighAnd5HighNot         8.1*        -2.9          7.8+
      HighAnd5HighOr        -1.6         -4.1        -12.9 
      HighAnd5LowNot       -37.3        -33.3        -39.1 
       HighAnd5LowOr       -60.8        -42.5        -60.7 
     LowAnd25HighNot        38.9         40.1         79.4*
      LowAnd25HighOr        44.8*        40.2+        23.5 
      LowAnd25LowNot        52.7+        55.7*        39.2+
       LowAnd25LowOr        51.1*        50.8+        23.7 
      LowAnd5HighNot        10.0+        12.0*        -2.7 
       LowAnd5HighOr         5.0          8.0*        -9.9 
       LowAnd5LowNot         2.6          4.1*       -10.1 
        LowAnd5LowOr        -8.8         -5.1        -29.1 

                Task         Good Method
    HighAnd25HighNot       ll, 
     HighAnd25HighOr       ll, bs, array, 
     HighAnd25LowNot       ll, 
      HighAnd25LowOr       
     HighAnd5HighNot       array, ll, 
      HighAnd5HighOr       
      HighAnd5LowNot       
       HighAnd5LowOr       
     LowAnd25HighNot       ll, 
      LowAnd25HighOr       array, bs, 
      LowAnd25LowNot       bs, array, ll, 
       LowAnd25LowOr       array, bs, 
      LowAnd5HighNot       bs, array, 
       LowAnd5HighOr       bs, 
       LowAnd5LowNot       bs, 
        LowAnd5LowOr       
{code}


Now, I'm just using BAS and BLS for cases with MUST, as BS's perfermance is not 
very competitive.
Even though BS seems to be a compelling choice for the case LowAnd5HighOr, its 
superiority to BAS is not huge.
Besides, BS can make the logics even more complicate, as BS is BulkScorer while 
others are Scorer.

If we still need to give up one scorer, I think it would be better to give up 
BLS,
as it looks that BAS to have more positive value than BLS.


{quote}
It's not great adding so much
complexity for performance gains of unusual (so many clauses) boolean
queries...
{quote}
I'm going to just focus the \*And5\* & \*And10\* cases to optimize the perf.
If 10 clauses are still too many, I will just focus on the \*And5\* cases.


Besides, today when I did tests to check the correctness, I discovered that 
BS, BAS, as well as BLS, are not supposed to deal with pure conjunction.

TestBooleanQueryVisitSubscorers.testConjunction will fail, when these scorers 
have only MUST clauses.
This is because these three scorers do not have .getChildren() implemented.

I think we should throw exception when people try to take them as 
ConjunctionScorer. Do you agree?


> BooleanScorer should sometimes be used for MUST clauses
> -------------------------------------------------------
>
>                 Key: LUCENE-4396
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4396
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
> LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, 
> luceneutil-score-equal.patch, stat.cpp, stat.cpp, tasks.cpp
>
>
> Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
> If there is one or more MUST clauses we always use BooleanScorer2.
> But I suspect that unless the MUST clauses have very low hit count compared 
> to the other clauses, that BooleanScorer would perform better than 
> BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
> handle MUST so it shouldn't be hard to bring back this capability ... I think 
> the challenging part might be the heuristics on when to use which (likely we 
> would have to use firstDocID as proxy for total hit count).
> Likely we should also have BooleanScorer sometimes use .advance() on the subs 
> in this case, eg if suddenly the MUST clause skips 1000000 docs then you want 
> to .advance() all the SHOULD clauses.
> I won't have near term time to work on this so feel free to take it if you 
> are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

Reply via email to