from:"Da Huang"

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-18 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: tasks.cpp
And.tasks

A new tasks file, and the program which can generate it.

In order to generate the tasks file with the program, you can run:
{code}
g++ tasks.cpp -std=c++0x -o tasks
./tasks   wikimedium.10M.nostopwords.tasks  And.tasks
{code}

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, And.tasks, AndOr.tasks, 
 AndOr.tasks, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, 
 LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, 
 luceneutil-score-equal.patch, luceneutil-score-equal.patch, 
 merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, 
 stat.cpp, stat.cpp, tasks.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-17 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396-simple.patch

Oh, that's very unfortunate. It seems that the only choice is to recover the BS.

In this patch, I've recovered the BS. Hope to have better perf.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, 
 LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, 
 luceneutil-score-equal.patch, luceneutil-score-equal.patch, 
 merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, 
 stat.cpp, stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-17 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100255#comment-14100255
 ] 

Da Huang commented on LUCENE-4396:
--

I've tested again with the setup exactly same as mike's.
Here's the performance.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
HighSpanNear1.05  (2.1%)1.04  (2.1%)   
-1.6% (  -5% -2%)
HighSloppyPhrase3.83  (5.3%)3.78  (4.9%)   
-1.3% ( -10% -9%)
 LowTerm   78.04  (4.5%)   77.13  (4.5%)   
-1.2% (  -9% -8%)
 MedSpanNear2.89  (3.9%)2.86  (3.3%)   
-1.1% (  -8% -6%)
 LowSpanNear5.91  (4.9%)5.84  (4.2%)   
-1.1% (  -9% -8%)
HighTerm8.02 (12.1%)7.94 (11.4%)   
-1.0% ( -21% -   25%)
 AndHighHigh9.84  (1.9%)9.74  (2.4%)   
-1.0% (  -5% -3%)
 MedTerm   30.63  (4.7%)   30.35  (5.1%)   
-0.9% ( -10% -9%)
 LowSloppyPhrase5.83  (4.4%)5.79  (4.5%)   
-0.7% (  -9% -8%)
 MedSloppyPhrase   16.86  (4.5%)   16.75  (4.3%)   
-0.6% (  -9% -8%)
   OrHighMed7.57  (4.5%)7.55  (4.1%)   
-0.3% (  -8% -8%)
OrNotHighLow7.87  (5.3%)7.84  (5.3%)   
-0.3% ( -10% -   10%)
  AndHighMed   25.10  (3.1%)   25.05  (3.7%)   
-0.2% (  -6% -6%)
  Fuzzy2   10.80  (2.7%)   10.78  (2.9%)   
-0.1% (  -5% -5%)
  OrHighHigh8.75  (4.4%)8.74  (4.1%)   
-0.1% (  -8% -8%)
OrHighNotMed7.33  (4.4%)7.33  (4.0%)   
-0.1% (  -8% -8%)
   OrNotHighHigh4.84  (5.1%)4.84  (5.0%)   
-0.1% (  -9% -   10%)
   OrHighLow6.67  (4.6%)6.66  (4.5%)   
-0.1% (  -8% -9%)
OrNotHighMed2.90  (5.2%)2.89  (5.2%)   
-0.1% ( -10% -   10%)
   OrHighNotHigh2.32  (4.9%)2.32  (4.6%)   
-0.0% (  -9% -9%)
  Fuzzy1   20.35  (3.1%)   20.38  (3.4%)
0.1% (  -6% -6%)
OrHighNotLow   13.54  (4.5%)   13.56  (4.2%)
0.2% (  -8% -9%)
   MedPhrase   11.75  (3.2%)   11.78  (2.4%)
0.2% (  -5% -5%)
   LowPhrase6.08  (2.9%)6.09  (2.7%)
0.2% (  -5% -6%)
  HighPhrase   13.25  (3.8%)   13.29  (3.4%)
0.3% (  -6% -7%)
 Prefix3   19.78  (3.2%)   19.85  (3.9%)
0.4% (  -6% -7%)
 Respell   15.13  (3.1%)   15.19  (3.7%)
0.4% (  -6% -7%)
Wildcard8.82  (3.3%)8.89  (4.9%)
0.8% (  -7% -9%)
  IntNRQ0.85  (4.2%)0.86  (6.0%)
1.3% (  -8% -   12%)
  AndHighLow  172.85  (4.9%)  175.57  (4.7%)
1.6% (  -7% -   11%)
{code}

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, 
 LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, 
 luceneutil-score-equal.patch, luceneutil-score-equal.patch, 
 merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, 
 stat.cpp, stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-16 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396-simple.patch

I have rebase the patch to the recent git mirror commit 
9069570eba29b3270bf5232f4fc8f6a156ff66d1 .

Besides, I've optimized the BooleanScorerCollector to make the coord calculated 
in the constructor.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 SIZE.perf, all.perf, luceneutil-score-equal.patch, 
 luceneutil-score-equal.patch, merge-simple.perf, merge-simple.png, 
 merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-15 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: merge-simple.png
LUCENE-4396-simple.patch
merge-simple.perf

This is a patch based on git mirror commit 
67d17eb81b754fa242bb91e1b91070fd8b38ecd9 .

In this patch, I simplify the logics of choosing scorers.
I think the logic is quite simple and intuitive now.

[^merge-simple.perf] is its original performance data.
You can also refer to the following figures.
!merge-simple.png!

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, 
 luceneutil-score-equal.patch, luceneutil-score-equal.patch, 
 merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, 
 stat.cpp, stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-15 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098565#comment-14098565
]

Da Huang commented on LUCENE-4396:
--

Thanks for your suggestions, Mike!
{quote}
I'm worried about how BooleanWeight.bulkScorer first pulls BulkScorer
for the clauses, and then sometimes also pulls Scorer; pulling a
Scorer is not that cheap an operation in general.
{quote}
My current plan is to break from the first weights iteration when it comes to a
required scorer.
In this way, I'm sure that the times it pulls scorers is exactly the same as
the trunk does.

{quote}
Maybe if we added .cost() to bulk scorer we could avoid that?
{quote}
I don't think so. When the logics choose DAAT but not BS,
it has to wrap up to super.bulkScorer() and pulls all scorers again.

{quote}
Or maybe we could look at the BulkScorer, and if it's a DefaultBulkScorer,
just ask it for the Scorer it wrapped?
{quote}
This way may make it embarrassed when it's not a DefaultBulkScorer. but not
sure.
I will have a try.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks,
LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch,
LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf,
luceneutil-score-equal.patch, luceneutil-score-equal.patch,
merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png,
stat.cpp, stat.cpp, tasks.cpp

Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
If there is one or more MUST clauses we always use BooleanScorer2.
But I suspect that unless the MUST clauses have very low hit count compared
to the other clauses, that BooleanScorer would perform better than
BooleanScorer2. BooleanScorer still has some vestiges from when it used to
handle MUST so it shouldn't be hard to bring back this capability ... I think
the challenging part might be the heuristics on when to use which (likely we
would have to use firstDocID as proxy for total hit count).
Likely we should also have BooleanScorer sometimes use .advance() on the subs
in this case, eg if suddenly the MUST clause skips 100 docs then you want
to .advance() all the SHOULD clauses.
I won't have near term time to work on this so feel free to take it if you
are inspired!

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-15 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396-simple.patch

In this patch, I make BS's classes to be static, and adjust the scorers 
choosing logics so that the times it pull scorers is exactly the same as the 
trunk does.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, 
 luceneutil-score-equal.patch, luceneutil-score-equal.patch, 
 merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, 
 stat.cpp, stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-14 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096845#comment-14096845
 ] 

Da Huang commented on LUCENE-4396:
--

{quote}
it looks like we are down to one new added scorer
{quote}
Yes, we just have only one added scorer now.

{quote}
 I wonder if we can somehow simplify that decision
process?
{quote}
Yea, I agree. The current choosing logics is indeed too tricky.
I'm going to find a more simple and intuitive way.
I think the perf. figures showed in perf.png is still the most important 
reference.


 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 SIZE.perf, all.perf, luceneutil-score-equal.patch, 
 luceneutil-score-equal.patch, merge.perf, merge.png, perf.png, stat.cpp, 
 stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-13 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

This is a patch based on git mirror commit 
67d17eb81b754fa242bb91e1b91070fd8b38ecd9 .

In this patch, I added test cases to make sure score calculated by BS, BAS and 
DAAT are same.

Besides, I have deleted the unused logics and added comments.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 SIZE.perf, all.perf, luceneutil-score-equal.patch, 
 luceneutil-score-equal.patch, merge.perf, merge.png, perf.png, stat.cpp, 
 stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-11 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092678#comment-14092678
 ] 

Da Huang commented on LUCENE-4396:
--

Thanks for your sugggestions, Mike !

{quote}
Is it possible to make a test case showing what the bug was, and
that's fixed (and stays fixed)?
{quote}
The current test cases can show the bug, if you uncomment this line:
{code}
//  scorerOrClass = BooleanArrayScorer.class;
{code}

{quote}
Also, do we have a test case that fails if DAAT and TAAT scoring
differs (as it does on trunk today)? 
{quote}
Negative. I'll add the test case to the next patch.

{quote}
Can you add a comment to that
part in the code, linking to this issue and explaining the motivation
behind it?
{quote}
Sure.

{quote}
Can I commit TestBooleanUnevenly to trunk today? Seems like there's
no reason to wait...
{quote}
Yes, sure.


 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, 
 luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge.perf, 
 merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-10 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

This is a patch based on git mirror commit 
d707f783ab068b70752a3f9cfdc0dabb7f4fbadf .

In this patch, I tried to fix the .getChildren() problem in BAS and BLS.

I have tried to make .bulkScorer() choose DAAT, when scoreDocsInOrder is true.
However, I discovered that I have to copy the scorer choosing logics to 
.scoreDocsOutOfOrder() 
to make things right.

I have also tried to implement the .getChildren() method for BAS and BLS,
but the TAAT strategy will make scorers exhausted at the beginning.

Finally, I just throw UnsupportedOperationException in BAS.getChildren() and 
BLS.getChildren().


Besides, I have run more tests to make sure everything is right.
As you can see, the performance of HighAnd.\*Low.\* cases showed in merge.png 
is not good.
Therefore, I ran HighAnd.\*Low.\* cases with luceneutil's pattern filter, and 
the result is as follows.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAnd6LowOr9.44  (6.4%)9.19  (4.8%)   
-2.6% ( -12% -9%)
   HighAnd5LowOr9.00  (8.8%)8.85  (7.4%)   
-1.6% ( -16% -   16%)
   HighAnd3LowOr   11.89  (8.9%)   11.71  (7.8%)   
-1.6% ( -16% -   16%)
   HighAnd4LowOr   10.78  (7.4%)   10.61  (6.3%)   
-1.5% ( -14% -   13%)
   HighAnd7LowOr9.08  (7.2%)8.94  (5.8%)   
-1.5% ( -13% -   12%)
   HighAnd8LowOr6.32  (8.6%)6.23  (6.9%)   
-1.4% ( -15% -   15%)
   HighAnd9LowOr5.71  (5.7%)5.65  (4.5%)   
-1.1% ( -10% -9%)
PKLookup   98.95  (4.5%)   98.38  (2.4%)   
-0.6% (  -7% -6%)
  HighAnd9LowNot7.49  (3.7%)7.46  (3.2%)   
-0.4% (  -7% -6%)
  HighAnd4LowNot   10.33  (6.4%)   10.31  (6.1%)   
-0.2% ( -11% -   13%)
  HighAnd8LowNot6.69  (5.3%)6.70  (4.9%)
0.1% (  -9% -   10%)
  HighAnd7LowNot6.82  (5.1%)6.84  (5.0%)
0.3% (  -9% -   10%)
  HighAnd6LowNot9.45  (5.5%)9.48  (4.7%)
0.3% (  -9% -   11%)
  HighAnd3LowNot   10.80  (6.7%)   10.87  (6.1%)
0.6% ( -11% -   14%)
  HighAnd5LowNot4.28  (7.4%)4.32  (7.1%)
1.0% ( -12% -   16%)
{code}
Everything looks right.

I have also run tests for more complicate tasks.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
 LowAnd6LowOr6LowNot   31.59  (1.0%)   28.52  (2.4%)   
-9.7% ( -12% -   -6%)
HighAnd6LowOr6LowNot6.10  (2.7%)5.76  (4.0%)   
-5.6% ( -11% -1%)
 MedAnd6LowOr6LowNot7.33  (2.3%)7.03  (3.1%)   
-4.0% (  -9% -1%)
HighAnd6MedOr6LowNot3.51  (1.5%)3.49  (2.6%)   
-0.6% (  -4% -3%)
PKLookup   95.99  (5.1%)   95.48  (4.9%)   
-0.5% ( -10% -9%)
HighAnd6MedOr6MedNot1.96  (1.3%)1.97  (2.5%)
0.4% (  -3% -4%)
 MedAnd6MedOr6MedNot2.34  (1.2%)2.35  (2.3%)
0.5% (  -2% -4%)
   HighAnd6LowOr6HighNot1.31  (1.1%)1.33  (2.4%)
0.9% (  -2% -4%)
HighAnd6LowOr6MedNot3.08  (1.5%)3.12  (2.7%)
1.2% (  -2% -5%)
 MedAnd6LowOr6MedNot3.72  (1.4%)3.89  (2.6%)
4.8% (   0% -8%)
   HighAnd6MedOr6HighNot1.40  (1.0%)1.53  (2.4%)
9.3% (   5% -   12%)
 LowAnd6LowOr6MedNot9.23  (2.1%)   10.19  (2.7%)   
10.4% (   5% -   15%)
LowAnd6LowOr6HighNot6.04  (2.5%)6.74  (2.9%)   
11.6% (   6% -   17%)
   LowAnd6HighOr6HighNot4.15  (3.4%)4.72  (4.2%)   
13.8% (   5% -   22%)
MedAnd6MedOr6HighNot1.65  (1.2%)1.91  (2.2%)   
15.7% (  12% -   19%)
MedAnd6LowOr6HighNot2.42  (1.7%)2.80  (2.7%)   
16.0% (  11% -   20%)
LowAnd6HighOr6LowNot4.69  (2.9%)5.45  (3.7%)   
16.1% (   9% -   23%)
 MedAnd6MedOr6LowNot3.45  (1.2%)4.04  (2.1%)   
17.1% (  13% -   20%)
 LowAnd6MedOr6LowNot8.77  (1.6%)   10.38  (2.4%)   
18.4% (  14% -   22%)
 LowAnd6MedOr6MedNot6.36  (2.6%)7.55  (3.5%)   
18.6% (  12% -   25%)
LowAnd6MedOr6HighNot5.48  (3.1%)6.51  (3.9%)   
18.8% (  11% -   26%)
LowAnd6HighOr6MedNot5.77  (3.1%)6.86  (4.3%)   
18.9

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-08 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: perf.png

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, 
 luceneutil-score-equal.patch, perf.png, stat.cpp, stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-08 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: merge.png
merge.perf
LUCENE-4396.patch

This is a patch based on git mirror commit 
a5a2e716ebcba1a201c4934f336ae9c0fcb551bf .

In this patch, I have fixed a bug of wrong coord counting.

Besides, I have come up with an awesome idea on choosing scorer and implemented 
this idea in this patch.

The following is the story of this idea.

After I have completed performance tests for each scorer, I plotted figures for 
the results, 
so that I can have an intuitive view on natures of each scorers.
The figures are showed as follows.
!perf.png!

Then, I discovered that the performance of each scorer can probably be fitted 
by a straight line.
It may be confusing that there're several points which look distinctive, such 
as (8, -20) on BAS, in LowAndNLowOr case.
However, when I retested BAS again, its performance went to 10 with N = 8.
Therefore, I just consider these 'distinctive' points as noisy points.

So, the following things to do is to get those performance curves' expressions.
Firstly, just have a look at the perf. figures again.
We can find that BAS can get the best performance on average, so we just 
discuss BAS here.

The expressions of each performance curve' fitting line are showed as follows.  
(Suppose that the horizontal axis is 'x', while the vertical axis is 'y')
|| LowAndNLowOr  || LowAndNHighOr  || HighAndNLowOr  || HighAndNHighOr  ||
| y = 5.33x - 31 | y = 3.83x - 8.5 | y = 1.67x - 55  | y = 7.5x - 32.5  |
|| LowAndNLowNot || LowAndNHighNot || HighAndNLowNot || HighAndNHighNot ||
| y = 4.5x - 18.5 | y = 3x - 7   | y = -0.83x - 22.5 | y = 7x - 31  |

I got these expressions just by visual estimation.
You can also get similiar expressions by drawing a straight line between the 
first and last point on each firgure.

Now suppose that the general performance expression is y = A \* x + B .
In lucene/BooleanQuery, the only information we have is requiredCost and 
optionalCost(or prohibitedCost).
For convenience, Let's just symbolize these two values as 'a' and 'b' 
respectively.

If we can find two functions, f and g, which have A = f(a, b), B = g(a, b), 
we can get the performance curve in the program.

For convenience, we just discuss A = f(a, b) here, and the case of B is just 
similiar to A.
The same, we just discuss \*Or cases here, and \*Not cases are just similiar 
ones.

Here, we know the values of a and b for each case.
|| LowAndNLowOr  || LowAndNHighOr  || HighAndNLowOr  || HighAndNHighOr  ||
| a = L, b = L   | a = L, b = H| a = H, b = L| a = H, b = H |

Among, L represents a low cost, while H represents a high cost.
We can evaluate these two value by doing a statistics on 
wikimedium.10M.nostopwords.tasks in luceneutil.
Here, their evaluated values are:
{code}
H = 747310, L = 34750
{code}

As, you can see, the values of H and L are too high.
Here, we get their log value; that is
{code}
h = log(H), l = log(L)
{code}

Suppose that f is formatted as 
{code}
f(a,b) = k1 * u1(a, b) + k2 * u2(a, b) + k3 * u3(a, b) + k4 * u4(a, b)
{code}

Thus, we have
{code}
k1 * u1(l, l) + k2 * u2(l, l) + k3 * u3(l, l) + k4 * u4(l, l) = 5.33
k1 * u1(l, h) + k2 * u2(l, h) + k3 * u3(l, h) + k4 * u4(l, h) = 3.84
k1 * u1(h, l) + k2 * u2(h, l) + k3 * u3(h, l) + k4 * u4(h, l) = 1.67
k1 * u1(h, h) + k2 * u2(h, h) + k3 * u3(h, h) + k4 * u4(h, h) = 7.5
{code}

The following question is how to choose ui(a, b).
Actually, I have tried many formulations, and I found the following is the best.

||u1(a,b)||u2(a,b)||u3(a,b)||u4(a,b)||
|a   |b   |  a\*b | a\*b/(a+b)|

I think such setup has its physical meanings.
u1 and u2 are the influence factors of a and b respectively.
u3 represents the higher dimentional factor.
u4 is half of the harmonic mean. 

Thus, we have
{code}
[ l  l  l*l l  ]   [ k1 ]   [5.33]
[ l  h  l*h  l*h/(l+h) ] * [ k2 ] = [3.84]
[ h  l  h*l  l*h/(l+h) ]   [ k3 ]   [1.67]
[ h  h  h*h h  ]   [ k4 ]   [7.5 ]

that is
[  10.4559   10.4559  109.3266  5.2280 ]  
[  10.4559   13.5242  141.4085  5.8969 ] * [k] = [A]
[  13.5242   10.4559  141.4085  5.8969 ]
[  13.5242   13.5242  182.9049  6.7621 ]

or symbolized
[U] * [k] = [A]
{code}

Luckily, \[U\] is a good matrix, which means that its inverse matrix is 
'calculable'.
{code}
   [ -1.43651.11061.4365   -1.1106 ]
inv([U]) = [ -1.43651.43651.1106   -1.1106 ]
   [ -0.03120.0.0.0241 ]
   [  6.5893   -5.0943   -5.09433.9386 ]
   
[k] = inv([U]) * [A] = [-9.3396  -8.6334  0.0145  36.6630]'
{code}

Now, in the program, we can get A by,
{code}
A = f(a,b) = k1 * a + k2 * b + k3 * a * b + k4 * a * b / (a + b)
{code}
and get B in a similiar way.

Finally, we get the evaluated fitting straight line of BAS in a specific case.
y

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-05 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086295#comment-14086295
 ] 

Da Huang commented on LUCENE-4396:
--

Thanks for Mike and Paul's suggestions.

{quote}
It's a bit spooky that collectMore recurses on itself; in theory
there's an adversary that could consume quite a bit of stack right?
Can we refactor that to the equivalent while loop (it's just tail
recursion).
{quote}
Ok. Doing collectMore without recursion is easy.

{quote}
Unfortunately the logic for picking which scorer to use looks really
complex; hopefully we can simplify it.
Also, do we really need 3 scorer classes (BS, BAS, BLS) for the
non-DAAT case? Ie, does each really provide a compelling situation
where it's better than the others? 
{quote}
Actually, these scorers are still very competitive when clauses are much fewer.
I have done some tests today. Results are as follows.

{code}
Taskarray   bs   ll
HighAnd10HighNot34.7 31.7 53.7*
 HighAnd10HighOr27.0+-0.9 32.6*
 HighAnd10LowNot   -33.5 -6.5-17.5 
  HighAnd10LowOr   -36.7 -2.1-43.4 
 HighAnd5HighNot 3.0-10.0 16.8*
  HighAnd5HighOr   -11.5 -2.0 -4.2 
  HighAnd5LowNot   -44.5 -9.4-36.6 
   HighAnd5LowOr   -56.2 -3.4-61.9 
 LowAnd10HighNot18.2+18.7+20.0*
  LowAnd10HighOr21.2+26.1* 6.0 
  LowAnd10LowNot19.6*-2.4  5.7 
   LowAnd10LowOr13.3*-2.7-11.1 
  LowAnd5HighNot 9.1* 6.9+-2.7 
   LowAnd5HighOr 7.6 12.5*-9.3 
   LowAnd5LowNot-0.9 -4.0-11.0 
LowAnd5LowOr-7.5 -3.6-27.2 

Task Good Method
HighAnd10HighNot   ll, 
 HighAnd10HighOr   ll, array, 
 HighAnd10LowNot   
  HighAnd10LowOr   
 HighAnd5HighNot   ll, 
  HighAnd5HighOr   
  HighAnd5LowNot   
   HighAnd5LowOr   
 LowAnd10HighNot   ll, bs, array, 
  LowAnd10HighOr   bs, array, 
  LowAnd10LowNot   array, 
   LowAnd10LowOr   array, 
  LowAnd5HighNot   array, bs, 
   LowAnd5HighOr   bs, 
   LowAnd5LowNot   
LowAnd5LowOr   

Taskarray   bs   ll
HighAnd25HighNot75.3 79.1131.5*
 HighAnd25HighOr69.8+74.2+80.8*
 HighAnd25LowNot-1.0  3.8 15.7*
  HighAnd25LowOr-3.8-34.7 -9.1 
 HighAnd5HighNot 8.1*-2.9  7.8+
  HighAnd5HighOr-1.6 -4.1-12.9 
  HighAnd5LowNot   -37.3-33.3-39.1 
   HighAnd5LowOr   -60.8-42.5-60.7 
 LowAnd25HighNot38.9 40.1 79.4*
  LowAnd25HighOr44.8*40.2+23.5 
  LowAnd25LowNot52.7+55.7*39.2+
   LowAnd25LowOr51.1*50.8+23.7 
  LowAnd5HighNot10.0+12.0*-2.7 
   LowAnd5HighOr 5.0  8.0*-9.9 
   LowAnd5LowNot 2.6  4.1*   -10.1 
LowAnd5LowOr-8.8 -5.1-29.1 

Task Good Method
HighAnd25HighNot   ll, 
 HighAnd25HighOr   ll, bs, array, 
 HighAnd25LowNot   ll, 
  HighAnd25LowOr   
 HighAnd5HighNot   array, ll, 
  HighAnd5HighOr   
  HighAnd5LowNot   
   HighAnd5LowOr   
 LowAnd25HighNot   ll, 
  LowAnd25HighOr   array, bs, 
  LowAnd25LowNot   bs, array, ll, 
   LowAnd25LowOr   array, bs, 
  LowAnd5HighNot   bs, array, 
   LowAnd5HighOr   bs, 
   LowAnd5LowNot   bs, 
LowAnd5LowOr   
{code}


Now, I'm just using BAS and BLS for cases with MUST, as BS's perfermance is not 
very competitive.
Even though BS seems to be a compelling choice for the case LowAnd5HighOr, its 
superiority to BAS is not huge.
Besides, BS can make the logics even more complicate, as BS is BulkScorer while 
others are Scorer.

If we still need to give up one scorer, I think it would be better to give up 
BLS,
as it looks that BAS to have more positive value than BLS.


{quote}
It's not great adding so much
complexity for performance gains of unusual (so many clauses) boolean
queries...
{quote}
I'm going to just focus the \*And5\*  \*And10\* cases to optimize the perf.
If 10 clauses are still too many, I will just focus on the \*And5\* cases.


Besides, today

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-03 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: tasks.cpp
LUCENE-4396.patch
And.tasks

The patch based on git mirror commit 67d17eb81b754fa242bb91e1b91070fd8b38ecd9 .

In this patch, I remove those unused classes, encapsulate some functions and 
fix some bugs.

Besides, the tasks file used before has heavy relevance between cases.
I think it's not good. Therefore, I generate a new tasks file.

The file And.tasks is the new tasks file, while 'tasks.cpp' is the program to 
generate this tasks file.
You can generate tasks file by running
{code}
g++ tasks.cpp -std=c++0x -o tasks
./tasks  wikimedium.10M.nostopwords.tasks  And.tasks
{code}

The perf. on the new tasks file is as follows.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
  HighAnd5LowNot5.40  (5.1%)4.88  (4.2%)   
-9.6% ( -18% -0%)
   HighAnd5LowOr7.05 (10.2%)6.87  (3.8%)   
-2.6% ( -15% -   12%)
   LowAnd5LowNot   27.17  (2.1%)   26.47  (2.6%)   
-2.6% (  -7% -2%)
  HighAnd5HighOr1.13  (3.8%)1.11  (2.2%)   
-1.8% (  -7% -4%)
LowAnd5LowOr   31.82  (2.6%)   31.35  (2.3%)   
-1.5% (  -6% -3%)
PKLookup   98.80  (5.2%)  102.02  (6.3%)
3.3% (  -7% -   15%)
 HighAnd5HighNot1.95  (1.0%)2.04  (2.1%)
4.7% (   1% -7%)
  LowAnd5HighNot9.46  (2.9%)   10.32  (2.7%)
9.0% (   3% -   15%)
   LowAnd5HighOr7.56  (2.8%)8.42  (2.8%)   
11.4% (   5% -   17%)
  LowAnd60HighOr0.51  (2.5%)0.82  (4.8%)   
58.7% (  50% -   67%)
  LowAnd60LowNot2.61  (1.0%)4.64  (3.4%)   
78.0% (  72% -   83%)
 HighAnd60LowNot1.30  (1.2%)2.36  (3.7%)   
81.1% (  75% -   87%)
  HighAnd60LowOr1.18  (1.3%)2.15  (3.7%)   
82.0% (  76% -   88%)
   LowAnd60LowOr2.25  (0.6%)4.61  (4.2%)  
104.7% (  99% -  110%)
 HighAnd60HighOr0.10  (0.7%)0.26  (4.8%)  
151.2% ( 144% -  157%)
 LowAnd60HighNot0.53  (2.5%)1.62  (8.0%)  
204.0% ( 188% -  220%)
HighAnd60HighNot0.14  (0.9%)0.59  (8.9%)  
328.4% ( 315% -  341%)
{code}

My next step is to do more tests to get better rules and make sure the 
correctness. I think it can be finished by this Friday.

As the suggested pencil down date is comming, I will begin to scrub the code, 
improve the comments, and write document in conclusion.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, 
 luceneutil-score-equal.patch, stat.cpp, stat.cpp, tasks.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-03 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084332#comment-14084332
]

Da Huang commented on LUCENE-4396:
--

Hi, [~paul.elsc...@xs4all.nl].

The commit hash code mentioned here just indicates which commit the patch
should apply on.

If you want to get the java latest code discussed here for example, you can do
these
{code}
git clone https://github.com/apache/lucene-solr
cd lucene-solr
git checkout 67d17eb81b754fa242bb91e1b91070fd8b38ecd9
git apply LUCENE-4396.patch
{code}

LUCENE-4396.patch is attached on this page, you can download it first.

Hope this can help you.

btw, there is a repo where I'm maintaining the code, but the repo is on the
server in my lab.
You're not able to clone from that repo without password.
Sorry for that.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks,
LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch,
LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch,
luceneutil-score-equal.patch, stat.cpp, stat.cpp, tasks.cpp

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-08-02 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

This is a patch based on git mirror commit 
67d17eb81b754fa242bb91e1b91070fd8b38ecd9

In this patch, I go further based on the last patch.

Firstly, I move all scorer choosing logics to .bulkScorer(), so that there's no 
need to wrap scorer in .bulkScorer().

Secondly, I have tried to use BooleanScorer for some cases with MUST.
However, it seems that there's something wrong with my test on BS before.
The perf. of BS can just beat DAAT on 2 cases, and BS perfs worse than other 
explored scorers on these 2 cases.

Ther perf of BQ(the merged scorer) and BS is showed as follows.

{code}
BQ
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndTonsLowNot5.01  (3.5%)4.29  (2.7%)  
-14.3% ( -19% -   -8%)
   HighAndSomeLowNot   15.33  (5.1%)   13.71  (5.4%)  
-10.6% ( -20% -0%)
 LowAndSomeLowOr  240.72  (2.5%)  217.73  (2.5%)   
-9.6% ( -14% -   -4%)
LowAndSomeLowNot  269.51  (1.4%)  244.76  (2.3%)   
-9.2% ( -12% -   -5%)
HighAndTonsLowOr5.19  (5.3%)4.94  (2.0%)   
-4.8% ( -11% -2%)
  HighAndSomeHighNot1.60  (2.0%)1.57  (2.6%)   
-1.9% (  -6% -2%)
HighAndSomeLowOr6.65 (11.5%)6.77  (4.1%)
1.8% ( -12% -   19%)
PKLookup   96.93  (2.3%)   99.72  (4.1%)
2.9% (  -3% -9%)
   LowAndSomeHighNot   59.45  (1.5%)   61.63  (2.4%)
3.7% (   0% -7%)
LowAndSomeHighOr   40.78  (2.0%)   42.75  (3.0%)
4.8% (   0% -   10%)
   HighAndSomeHighOr2.11  (2.8%)2.44  (3.0%)   
16.1% (  10% -   22%)
LowAndTonsLowNot   17.45  (1.3%)   20.88  (2.5%)   
19.6% (  15% -   23%)
LowAndTonsHighOr2.76  (1.6%)3.34  (3.1%)   
21.0% (  16% -   26%)
 LowAndTonsLowOr   15.36  (1.2%)   19.83  (3.1%)   
29.2% (  24% -   33%)
   HighAndTonsHighOr0.08  (0.7%)0.21  (5.1%)  
159.8% ( 152% -  166%)
   LowAndTonsHighNot1.69  (1.5%)5.14  (5.9%)  
204.0% ( 193% -  214%)
  HighAndTonsHighNot0.09  (0.7%)0.41 (11.0%)  
359.9% ( 345% -  374%)


BooleanScorer
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
LowAndSomeHighOr   51.38  (1.7%)1.47  (0.4%)  
-97.1% ( -97% -  -96%)
LowAndTonsHighOr2.79  (1.5%)0.10  (0.5%)  
-96.5% ( -97% -  -95%)
   LowAndTonsHighNot1.71  (2.0%)0.17  (0.7%)  
-90.3% ( -91% -  -89%)
   LowAndSomeHighNot   32.69  (2.2%)3.18  (0.6%)  
-90.3% ( -91% -  -89%)
 LowAndSomeLowOr  258.50  (1.7%)   91.84  (1.6%)  
-64.5% ( -66% -  -62%)
HighAndSomeLowOr   12.66  (9.1%)5.89  (2.3%)  
-53.5% ( -59% -  -46%)
LowAndSomeLowNot  252.33  (2.1%)  124.57  (1.1%)  
-50.6% ( -52% -  -48%)
HighAndTonsLowOr3.13  (7.5%)1.57  (2.3%)  
-49.7% ( -55% -  -43%)
 LowAndTonsLowOr   14.17  (0.8%)7.32  (2.6%)  
-48.4% ( -51% -  -45%)
   HighAndSomeLowNot   18.01  (5.6%)   10.03  (2.8%)  
-44.3% ( -49% -  -37%)
LowAndTonsLowNot   17.17  (1.1%)   11.33  (1.5%)  
-34.0% ( -36% -  -31%)
   HighAndTonsLowNot6.29  (2.5%)4.73  (2.4%)  
-24.9% ( -29% -  -20%)
   HighAndSomeHighOr1.66  (3.1%)1.28  (7.5%)  
-22.7% ( -32% -  -12%)
  HighAndSomeHighNot2.11  (1.4%)1.83  (3.4%)  
-13.5% ( -18% -   -8%)
PKLookup   96.92  (4.0%)   94.94  (2.5%)   
-2.0% (  -8% -4%)
   HighAndTonsHighOr0.07  (0.5%)0.09 (18.2%)   
38.3% (  19% -   57%)
  HighAndTonsHighNot0.04  (1.9%)0.16 (24.4%)  
263.0% ( 232% -  294%)

{code}

By the perf. table of BQ, it looks that BQ perfs low on the first 4 cases.
However, when I run these cases one by one, they're just worse than the trunk 
within 2%.
I'm not sure what makes this happen?

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-28 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

This is a patch based on git mirror commit 
ce7d0578b30981d15687bf76aec595274efccbad
I've tried to merge all explored methods to get a better performance for 
boolean retrieval.

In this patch, I just mix methods in BooleanQuery.BooleanWeight.scorer()
I have tried to mix methods in .bulkScorer(), but it fails to pass the ant-test.

It took me lots of time to figure out the cause.
It turned out that I'm not supposed to call w.bulkScorer() to get optional 
scorer,
as well as prohibited scorer, in BooleanQuery.BooleanWeight.bulkScorer(), 
or the TestBooleanScorer.testEmbeddedBooleanScorer will throws an 
UnsupportedOperationException
because it calls an unimplemented .scorer() method.

It makes me embarrassed that I'm not able to get the cost of a scorer 
without an instance of Scorer.

Therefore, my next step is to check whether I can get optional scorer in 
.bulkScorer().
If yes, do the similar things as .scorer(). If no, just call BooleanScorer();

Besides, I'm very sorry that the code in this patch may looks ugly, 
as I haven't spared enough time to rearrange the code.

{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndTonsLowNot4.06  (4.0%)3.44  (5.1%)  
-15.5% ( -23% -   -6%)
   HighAndSomeLowNot   17.02  (5.3%)   15.61  (9.2%)   
-8.3% ( -21% -6%)
HighAndTonsLowOr5.82  (5.0%)5.67  (1.5%)   
-2.6% (  -8% -4%)
LowAndSomeHighOr   55.03  (3.0%)   54.39  (2.2%)   
-1.2% (  -6% -4%)
  HighAndSomeHighNot1.24  (2.3%)1.23  (2.3%)   
-1.0% (  -5% -3%)
 LowAndSomeLowOr  231.48  (1.8%)  229.47  (2.1%)   
-0.9% (  -4% -3%)
PKLookup   97.60  (2.1%)   97.63  (2.2%)
0.0% (  -4% -4%)
LowAndSomeLowNot  312.07  (2.0%)  312.28  (2.1%)
0.1% (  -3% -4%)
   HighAndSomeHighOr1.69  (2.6%)1.69  (1.2%)
0.4% (  -3% -4%)
HighAndSomeLowOr   14.28 (11.7%)   14.81  (4.7%)
3.7% ( -11% -   22%)
   LowAndSomeHighNot   34.74  (2.9%)   36.83  (2.6%)
6.0% (   0% -   11%)
LowAndTonsHighOr2.34  (2.7%)2.90  (3.2%)   
24.3% (  17% -   30%)
 LowAndTonsLowOr   18.88  (1.0%)   25.14  (3.0%)   
33.2% (  28% -   37%)
LowAndTonsLowNot   15.78  (1.4%)   22.29  (2.0%)   
41.2% (  37% -   45%)
   HighAndTonsHighOr0.06  (0.6%)0.17  (5.8%)  
179.9% ( 172% -  187%)
   LowAndTonsHighNot1.33  (2.4%)4.29  (8.1%)  
223.5% ( 207% -  239%)
  HighAndTonsHighNot0.06  (1.8%)0.34 (17.3%)  
495.0% ( 467% -  523%)
{code}   

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, 
 luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp, stat.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-24 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072693#comment-14072693
 ] 

Da Huang edited comment on LUCENE-4396 at 7/24/14 8:53 AM:
---

This patch is based on the git mirror commit 
ce7d0578b30981d15687bf76aec595274efccbad .
This is the first try to merge scorers, so that we can get a better perf of 
boolean retrieval.

I create a new class named BooleanMixedScorerDecider to choose the best 
scorer.
Rules for choosing remains to be improved. I have been working on it to find an 
elegant way to define rules.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndSomeLowNot   11.53  (7.3%)   10.75 (10.1%)   
-6.8% ( -22% -   11%)
   HighAndTonsLowNot4.87  (4.0%)4.64  (6.0%)   
-4.9% ( -14% -5%)
 LowAndSomeLowOr  306.20  (2.2%)  299.06  (2.8%)   
-2.3% (  -7% -2%)
HighAndSomeLowOr   13.67  (9.4%)   13.38  (2.7%)   
-2.1% ( -13% -   11%)
HighAndTonsLowOr4.04  (6.4%)3.96  (1.9%)   
-1.9% (  -9% -6%)
LowAndSomeLowNot  215.18  (1.9%)  211.14  (2.2%)   
-1.9% (  -5% -2%)
PKLookup   96.26  (2.3%)   94.56  (2.8%)   
-1.8% (  -6% -3%)
  HighAndTonsHighNot0.06  (2.3%)0.06  (2.6%)   
-1.0% (  -5% -4%)
   HighAndTonsHighOr0.06  (0.6%)0.06  (1.3%)
0.9% (   0% -2%)
  HighAndSomeHighNot1.59  (2.2%)1.62  (2.9%)
1.7% (  -3% -6%)
   LowAndSomeHighNot   66.33  (2.1%)   68.77  (2.1%)
3.7% (   0% -8%)
LowAndSomeHighOr   53.75  (1.6%)   56.86  (2.1%)
5.8% (   1% -9%)
LowAndTonsLowNot   14.00  (1.7%)   14.84  (1.5%)
6.1% (   2% -9%)
   HighAndSomeHighOr2.39  (2.2%)2.68  (3.5%)   
12.4% (   6% -   18%)
 LowAndTonsLowOr   17.69  (0.9%)   21.64  (1.7%)   
22.3% (  19% -   25%)
LowAndTonsHighOr1.83  (1.3%)2.33  (2.4%)   
27.2% (  23% -   31%)
   LowAndTonsHighNot1.15  (1.5%)1.51  (3.1%)   
30.9% (  25% -   36%)
{code}


was (Author: dhuang):
This is the first try to merge scorers, so that we can get a better perf of 
boolean retrieval.

I create a new class named BooleanMixedScorerDecider to choose the best 
scorer.
Rules for choosing remains to be improved. I have been working on it to find an 
elegant way to define rules.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndSomeLowNot   11.53  (7.3%)   10.75 (10.1%)   
-6.8% ( -22% -   11%)
   HighAndTonsLowNot4.87  (4.0%)4.64  (6.0%)   
-4.9% ( -14% -5%)
 LowAndSomeLowOr  306.20  (2.2%)  299.06  (2.8%)   
-2.3% (  -7% -2%)
HighAndSomeLowOr   13.67  (9.4%)   13.38  (2.7%)   
-2.1% ( -13% -   11%)
HighAndTonsLowOr4.04  (6.4%)3.96  (1.9%)   
-1.9% (  -9% -6%)
LowAndSomeLowNot  215.18  (1.9%)  211.14  (2.2%)   
-1.9% (  -5% -2%)
PKLookup   96.26  (2.3%)   94.56  (2.8%)   
-1.8% (  -6% -3%)
  HighAndTonsHighNot0.06  (2.3%)0.06  (2.6%)   
-1.0% (  -5% -4%)
   HighAndTonsHighOr0.06  (0.6%)0.06  (1.3%)
0.9% (   0% -2%)
  HighAndSomeHighNot1.59  (2.2%)1.62  (2.9%)
1.7% (  -3% -6%)
   LowAndSomeHighNot   66.33  (2.1%)   68.77  (2.1%)
3.7% (   0% -8%)
LowAndSomeHighOr   53.75  (1.6%)   56.86  (2.1%)
5.8% (   1% -9%)
LowAndTonsLowNot   14.00  (1.7%)   14.84  (1.5%)
6.1% (   2% -9%)
   HighAndSomeHighOr2.39  (2.2%)2.68  (3.5%)   
12.4% (   6% -   18%)
 LowAndTonsLowOr   17.69  (0.9%)   21.64  (1.7%)   
22.3% (  19% -   25%)
LowAndTonsHighOr1.83  (1.3%)2.33  (2.4%)   
27.2% (  23% -   31%)
   LowAndTonsHighNot1.15  (1.5%)1.51  (3.1%)   
30.9% (  25% -   36%)
{code}

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-24 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074083#comment-14074083
 ] 

Da Huang commented on LUCENE-4396:
--

{quote}
Do we really need a separate class to make the decision about which scorer to 
use? Seems like the added logic for when to use BNS is fairly small so we could 
just add it into BQ's scorer method instead?
{quote}
OK, I will move the decision logic back to BQ.

{quote}
For bulkScorer, should we ever return BooleanScorer even when there are 
required clauses? Or was that just commented out for temporary benchmarking so 
we'd wrap BNS? When there is a required clause, if BNS is never slower than BS, 
then instead of falling back to super.bulkScorer we could do the wrapping 
ourselves there? Just to make it clearer we are using BNS ... or maybe just put 
a comment saying so (replacing that TODO).
{quote}
BooleanScorer should be applied for bulkScorer under some cases. Now I turn to 
super.bulkScorer when there are required clauses is just a temporary strategy.
See the following tables.
{code}
Task  ArrayNotDel   BS   BitSet   ll
 llbssize5size8size9
  HighAndSomeHighNot 0.7 15.3* 7.4  8.9 
 2.0  6.6 10.0  3.4 
   HighAndSomeHighOr13.3 24.5* 7.8  9.1 
10.9 17.3+18.3+21.3+
   HighAndSomeLowNot   -45.1-53.9-55.0-57.3
-45.5-47.8-42.2-41.5 
HighAndSomeLowOr   -44.7-55.4-51.2-58.1
-54.5-47.9-39.7-44.9 
  HighAndTonsHighNot   475.7+   472.7+   507.0+   552.9+   
627.9*   149.1144.7143.7 
   HighAndTonsHighOr   141.0+   135.4+   162.4+   153.4+   
169.7*   154.0+   150.0+   149.1+
   HighAndTonsLowNot   -49.9-66.2-46.8-76.9
-30.3-73.7-28.6-15.6 
HighAndTonsLowOr   -22.4-69.4-30.2-67.5
-41.9-63.8-24.4-13.9 
   LowAndSomeHighNot 3.7 -2.6 -9.0 -7.3 
-6.2  4.5+ 6.2* 4.7+
LowAndSomeHighOr 1.5-14.0-15.5-10.8
-12.0  6.8* 5.8+ 6.6+
LowAndSomeLowNot   -26.4-43.7-56.5-47.3
-43.7  3.7*-2.3 -4.0 
 LowAndSomeLowOr   -23.2-41.8-60.5-46.2
-43.4  2.2*-2.3 -8.8 
   LowAndTonsHighNot   380.6+   171.5118.4248.3
381.8*22.5 23.8 26.5 
LowAndTonsHighOr29.8* 5.2 -1.1 10.7 
 5.4 24.2+27.5+28.2+
LowAndTonsLowNot28.9  9.1-39.3  5.3 
 1.3 39.1+47.2*44.3+
 LowAndTonsLowOr30.9+ 7.2-38.1  0.5 
 9.0 29.9+40.9*38.1+

Task Good Method
  HighAndSomeHighNot   BS, 
   HighAndSomeHighOr   BS, size9, size8, size5, 
   HighAndSomeLowNot   
HighAndSomeLowOr   
  HighAndTonsHighNot   llbs, ll, BitSet, ArrayNotDel, BS, 
   HighAndTonsHighOr   llbs, BitSet, size5, ll, size8, size9, ArrayNotDel, 
BS, 
   HighAndTonsLowNot   
HighAndTonsLowOr   
   LowAndSomeHighNot   size8, size9, size5, 
LowAndSomeHighOr   size5, size9, size8, 
LowAndSomeLowNot   size5, 
 LowAndSomeLowOr   size5, 
   LowAndTonsHighNot   llbs, ArrayNotDel, 
LowAndTonsHighOr   ArrayNotDel, size9, size8, size5, 
LowAndTonsLowNot   size8, size9, size5, 
 LowAndTonsLowOr   size8, size9, ArrayNotDel, size5, 
{code}
BS perferms the best for HighAndSomeHigh* cases.

{quote}
For the rules on when to use which scorer, it seems like we should take the 
.cost() of the sub-clauses into account somehow...
{quote}
I have already take .cost() into account see the rules in the decider.
{code}
if (!required.isEmpty()  optional.size()  3) {
  float times = (float) required.get(0).cost() / optional.get(0).cost();
  if (times  1) return new BooleanNovelScorer(weight, disableCoord, 
minShouldMatch, required, optional, prohibited, maxCoord);
}   
if (!required.isEmpty()  prohibited.size()  3) {
  float times = (float) required.get(0).cost() / prohibited.get(0).cost();
  if (times  1) return new BooleanNovelScorer(weight, disableCoord, 
minShouldMatch, required, optional, prohibited, maxCoord);
}   
{code}
Here, I just take the first scorer's cost into account, as it may cost a lot to 
iterate all

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-23 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

This is the first try to merge scorers, so that we can get a better perf of 
boolean retrieval.

I create a new class named BooleanMixedScorerDecider to choose the best 
scorer.
Rules for choosing remains to be improved. I have been working on it to find an 
elegant way to define rules.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndSomeLowNot   11.53  (7.3%)   10.75 (10.1%)   
-6.8% ( -22% -   11%)
   HighAndTonsLowNot4.87  (4.0%)4.64  (6.0%)   
-4.9% ( -14% -5%)
 LowAndSomeLowOr  306.20  (2.2%)  299.06  (2.8%)   
-2.3% (  -7% -2%)
HighAndSomeLowOr   13.67  (9.4%)   13.38  (2.7%)   
-2.1% ( -13% -   11%)
HighAndTonsLowOr4.04  (6.4%)3.96  (1.9%)   
-1.9% (  -9% -6%)
LowAndSomeLowNot  215.18  (1.9%)  211.14  (2.2%)   
-1.9% (  -5% -2%)
PKLookup   96.26  (2.3%)   94.56  (2.8%)   
-1.8% (  -6% -3%)
  HighAndTonsHighNot0.06  (2.3%)0.06  (2.6%)   
-1.0% (  -5% -4%)
   HighAndTonsHighOr0.06  (0.6%)0.06  (1.3%)
0.9% (   0% -2%)
  HighAndSomeHighNot1.59  (2.2%)1.62  (2.9%)
1.7% (  -3% -6%)
   LowAndSomeHighNot   66.33  (2.1%)   68.77  (2.1%)
3.7% (   0% -8%)
LowAndSomeHighOr   53.75  (1.6%)   56.86  (2.1%)
5.8% (   1% -9%)
LowAndTonsLowNot   14.00  (1.7%)   14.84  (1.5%)
6.1% (   2% -9%)
   HighAndSomeHighOr2.39  (2.2%)2.68  (3.5%)   
12.4% (   6% -   18%)
 LowAndTonsLowOr   17.69  (0.9%)   21.64  (1.7%)   
22.3% (  19% -   25%)
LowAndTonsHighOr1.83  (1.3%)2.33  (2.4%)   
27.2% (  23% -   31%)
   LowAndTonsHighNot1.15  (1.5%)1.51  (3.1%)   
30.9% (  25% -   36%)
{code}

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, 
 luceneutil-score-equal.patch, stat.cpp, stat.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-21 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: all.perf
stat.cpp

I have retested previous explored methods, and do an statistic on their 
performance.

The file all.perf is the original perf. data.
stat.cpp is used to do an statistic on all.perf.
{code}
g++ -std=c++0x stat.cpp -o stat
./stat  all.perf
{code}

The perf. statistic results are showed as follows.
{code}
Task  ArrayNotDel   BS   BitSet   ll
 llbssize5size8size9
  HighAndSomeHighNot 0.7 15.3* 7.4  8.9 
 2.0  6.6 10.0  3.4 
   HighAndSomeHighOr13.3 24.5* 7.8  9.1 
10.9 17.3+18.3+21.3+
   HighAndSomeLowNot   -45.1-53.9-55.0-57.3
-45.5-47.8-42.2-41.5 
HighAndSomeLowOr   -44.7-55.4-51.2-58.1
-54.5-47.9-39.7-44.9 
  HighAndTonsHighNot   475.7+   472.7+   507.0+   552.9+   
627.9*   149.1144.7143.7 
   HighAndTonsHighOr   141.0+   135.4+   162.4+   153.4+   
169.7*   154.0+   150.0+   149.1+
   HighAndTonsLowNot   -49.9-66.2-46.8-76.9
-30.3-73.7-28.6-15.6 
HighAndTonsLowOr   -22.4-69.4-30.2-67.5
-41.9-63.8-24.4-13.9 
   LowAndSomeHighNot 3.7 -2.6 -9.0 -7.3 
-6.2  4.5+ 6.2* 4.7+
LowAndSomeHighOr 1.5-14.0-15.5-10.8
-12.0  6.8* 5.8+ 6.6+
LowAndSomeLowNot   -26.4-43.7-56.5-47.3
-43.7  3.7*-2.3 -4.0 
 LowAndSomeLowOr   -23.2-41.8-60.5-46.2
-43.4  2.2*-2.3 -8.8 
   LowAndTonsHighNot   380.6+   171.5118.4248.3
381.8*22.5 23.8 26.5 
LowAndTonsHighOr29.8* 5.2 -1.1 10.7 
 5.4 24.2+27.5+28.2+
LowAndTonsLowNot28.9  9.1-39.3  5.3 
 1.3 39.1+47.2*44.3+
 LowAndTonsLowOr30.9+ 7.2-38.1  0.5 
 9.0 29.9+40.9*38.1+

Task Good Method
  HighAndSomeHighNot   BS, 
   HighAndSomeHighOr   BS, size9, size8, size5, 
   HighAndSomeLowNot   
HighAndSomeLowOr   
  HighAndTonsHighNot   llbs, ll, BitSet, ArrayNotDel, BS, 
   HighAndTonsHighOr   llbs, BitSet, size5, ll, size8, size9, ArrayNotDel, 
BS, 
   HighAndTonsLowNot   
HighAndTonsLowOr   
   LowAndSomeHighNot   size8, size9, size5, 
LowAndSomeHighOr   size5, size9, size8, 
LowAndSomeLowNot   size5, 
 LowAndSomeLowOr   size5, 
   LowAndTonsHighNot   llbs, ArrayNotDel, 
LowAndTonsHighOr   ArrayNotDel, size9, size8, size5, 
LowAndTonsLowNot   size8, size9, size5, 
 LowAndTonsLowOr   size8, size9, ArrayNotDel, size5, 
{code}
Among them, 'll' is the linkedlist docs method, while 'llbs' is the linkedlist 
with bitset.
Character '*' marks the best perf, while '+' marks ones some kind of as good as 
the best perf.
I have been merging these methods. I decided to move the scorer choosing logic 
into a new class, but a bug come to me.
I'm working on it.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 SIZE.perf, all.perf, luceneutil-score-equal.patch, 
 luceneutil-score-equal.patch, stat.cpp, stat.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-15 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

This patch is based on git mirror commit 
ce7d0578b30981d15687bf76aec595274efccbad

In this patch, I just compact the array as I go through the MUST_NOT docs.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndTonsLowNot4.88  (3.5%)2.44  (4.4%)  
-49.9% ( -55% -  -43%)
   HighAndSomeLowNot6.55  (6.1%)3.60  (4.7%)  
-45.1% ( -52% -  -36%)
HighAndSomeLowOr9.93 (12.9%)5.49  (4.7%)  
-44.7% ( -55% -  -31%)
LowAndSomeLowNot  293.78  (2.3%)  216.29  (1.7%)  
-26.4% ( -29% -  -22%)
 LowAndSomeLowOr  347.60  (1.8%)  266.94  (1.2%)  
-23.2% ( -25% -  -20%)
HighAndTonsLowOr5.59  (5.7%)4.34  (4.4%)  
-22.4% ( -30% -  -13%)
PKLookup   97.38  (2.1%)   95.54  (2.9%)   
-1.9% (  -6% -3%) 
  HighAndSomeHighNot1.88  (2.2%)1.89  (6.6%)
0.7% (  -7% -9%) 
LowAndSomeHighOr   41.32  (2.9%)   41.92  (2.8%)
1.5% (  -4% -7%) 
   LowAndSomeHighNot   54.74  (2.4%)   56.73  (2.7%)
3.7% (  -1% -8%) 
   HighAndSomeHighOr2.26  (2.7%)2.56  (6.8%)   
13.3% (   3% -   23%)
LowAndTonsLowNot   17.18  (1.2%)   22.14  (2.4%)   
28.9% (  24% -   32%)
LowAndTonsHighOr1.21  (2.7%)1.57  (4.4%)   
29.8% (  22% -   37%)
 LowAndTonsLowOr   17.22  (1.3%)   22.53  (2.4%)   
30.9% (  26% -   35%)
   HighAndTonsHighOr0.07  (1.2%)0.16 (13.1%)  
141.0% ( 125% -  157%)
   LowAndTonsHighNot2.02  (2.4%)9.70  (9.7%)  
380.6% ( 360% -  402%)
  HighAndTonsHighNot0.09  (1.2%)0.50 (23.1%)  
475.7% ( 446% -  505%)
{code}

Besides, I am working combine all explored method to get a better perf now. 
In order to get more accurate perf of each method, I'm retesting some previous 
methods now. 
It may take several days to make a combined method work.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 SIZE.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, 
 stat.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-15 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063089#comment-14063089
]

Da Huang commented on LUCENE-4396:
--

Thank you, Mike!
{quote}
It looks like this gave some nice gains with the many-not cases
{quote}
Yes, but many-not cases may not be a usual case. Therefore, this method might
be used in the final method.

{quote}
Curiously some of the tasks are really hurt by the larger sizes ... maybe 19
is a good compromise?
{quote}
Yeah. Finally, I will just focus on those \*Some\* cases.
size9 is better for HighAndSomeHighOr case, while size5 is better for
LowAndSomeHighOr, LowAndSomeLowNot and LowAndSomeLowOr cases.
I think it would be better to detect the case type and adjust the SIZE of
bucketTable in BNS's constructor.

BooleanScorer should sometimes be used for MUST clauses
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-15 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063089#comment-14063089
]

Da Huang edited comment on LUCENE-4396 at 7/16/14 3:53 AM:
---

Thank you, Mike!
{quote}
It looks like this gave some nice gains with the many-not cases
{quote}
Yes, but many-not cases might not be a usual case. Therefore, this method might
not be used in the final method.

was (Author: dhuang):
Thank you, Mike!
{quote}
It looks like this gave some nice gains with the many-not cases
{quote}
Yes, but many-not cases may not be a usual case. Therefore, this method might
be used in the final method.

BooleanScorer should sometimes be used for MUST clauses
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-14 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: SIZE.perf
stat.cpp

I have done tests for different SIZE of bucketTable.
The file 'SIZE.perf' is the original test result data.

'stat.cpp' is a C++ program used to do statistic on *.perf files.
You can compile it with 'g++ stat.cpp -std=c++0x -o stat'
and run by './stat  SIZE.perf'

The statistic result for SIZE.perf is supposed to be as follows.
{code}
Task  size10  size11   size5   
size6   size7   size8   size9
  HighAndSomeHighNot   -14.5 4.0 6.6
-3.0 5.210.0*3.4
   HighAndSomeHighOr 2.410.917.3
17.412.918.321.3*
   HighAndSomeLowNot   -36.8   -37.3   -47.8   
-47.8   -40.2   -42.2   -41.5
HighAndSomeLowOr   -45.1   -46.4   -47.9   
-46.2   -38.7   -39.7   -44.9
  HighAndTonsHighNot   162.4*  145.1   149.1   
130.1   142.9   144.7   143.7
   HighAndTonsHighOr   154.8*  146.5   154.0   
137.8   144.9   150.0   149.1
   HighAndTonsLowNot   -27.0   -17.4   -73.7   
-49.6   -40.1   -28.6   -15.6
HighAndTonsLowOr   -28.7   -14.3   -63.8   
-44.8   -33.0   -24.4   -13.9
   LowAndSomeHighNot 3.0 0.2 4.5
 6.2*5.7 6.2*4.7
LowAndSomeHighOr 5.3 1.4 6.8*   
 6.7 7.7 5.8 6.6
LowAndSomeLowNot-6.3   -24.4 3.7*   
 0.8 1.7-2.3-4.0
 LowAndSomeLowOr   -10.3   -22.7 2.2*   
 2.0 1.7-2.3-8.8
   LowAndTonsHighNot27.3*   21.422.5
21.521.023.826.5
LowAndTonsHighOr23.128.224.2
23.929.1*   27.528.2
LowAndTonsLowNot33.046.539.1
33.430.047.2*   44.3
 LowAndTonsLowOr45.7*   34.629.9
36.845.340.938.1
{code}

size7 means the bucketTable's size is 1  7.

It seems that we can get a better result on *SOME* tasks if we combine size9 
with size5.


 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, 
 luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-14 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060615#comment-14060615
 ] 

Da Huang edited comment on LUCENE-4396 at 7/14/14 1:04 PM:
---

I have done tests for different SIZE of bucketTable.
The file 'SIZE.perf' is the original test result data.

'stat.cpp' is a C++ program used to do statistic on *.perf files.
You can compile it with 'g++ stat.cpp -std=c++0x -o stat'
and run by './stat  SIZE.perf'

The statistic result for SIZE.perf is supposed to be as follows.
{code}
Task  size10  size11   size5   
size6   size7   size8   size9
  HighAndSomeHighNot   -14.5 4.0 6.6
-3.0 5.210.0*3.4
   HighAndSomeHighOr 2.410.917.3
17.412.918.321.3*
   HighAndSomeLowNot   -36.8   -37.3   -47.8   
-47.8   -40.2   -42.2   -41.5
HighAndSomeLowOr   -45.1   -46.4   -47.9   
-46.2   -38.7   -39.7   -44.9
  HighAndTonsHighNot   162.4*  145.1   149.1   
130.1   142.9   144.7   143.7
   HighAndTonsHighOr   154.8*  146.5   154.0   
137.8   144.9   150.0   149.1
   HighAndTonsLowNot   -27.0   -17.4   -73.7   
-49.6   -40.1   -28.6   -15.6
HighAndTonsLowOr   -28.7   -14.3   -63.8   
-44.8   -33.0   -24.4   -13.9
   LowAndSomeHighNot 3.0 0.2 4.5
 6.2*5.7 6.2*4.7
LowAndSomeHighOr 5.3 1.4 6.8*   
 6.7 7.7 5.8 6.6
LowAndSomeLowNot-6.3   -24.4 3.7*   
 0.8 1.7-2.3-4.0
 LowAndSomeLowOr   -10.3   -22.7 2.2*   
 2.0 1.7-2.3-8.8
   LowAndTonsHighNot27.3*   21.422.5
21.521.023.826.5
LowAndTonsHighOr23.128.224.2
23.929.1*   27.528.2
LowAndTonsLowNot33.046.539.1
33.430.047.2*   44.3
 LowAndTonsLowOr45.7*   34.629.9
36.845.340.938.1
{code}

size7 means the bucketTable's size is 1  7.

It seems that we can get a better result on \*SOME\* tasks if we combine size9 
with size5.



was (Author: dhuang):
I have done tests for different SIZE of bucketTable.
The file 'SIZE.perf' is the original test result data.

'stat.cpp' is a C++ program used to do statistic on *.perf files.
You can compile it with 'g++ stat.cpp -std=c++0x -o stat'
and run by './stat  SIZE.perf'

The statistic result for SIZE.perf is supposed to be as follows.
{code}
Task  size10  size11   size5   
size6   size7   size8   size9
  HighAndSomeHighNot   -14.5 4.0 6.6
-3.0 5.210.0*3.4
   HighAndSomeHighOr 2.410.917.3
17.412.918.321.3*
   HighAndSomeLowNot   -36.8   -37.3   -47.8   
-47.8   -40.2   -42.2   -41.5
HighAndSomeLowOr   -45.1   -46.4   -47.9   
-46.2   -38.7   -39.7   -44.9
  HighAndTonsHighNot   162.4*  145.1   149.1   
130.1   142.9   144.7   143.7
   HighAndTonsHighOr   154.8*  146.5   154.0   
137.8   144.9   150.0   149.1
   HighAndTonsLowNot   -27.0   -17.4   -73.7   
-49.6   -40.1   -28.6   -15.6
HighAndTonsLowOr   -28.7   -14.3   -63.8   
-44.8   -33.0   -24.4   -13.9
   LowAndSomeHighNot 3.0 0.2 4.5
 6.2*5.7 6.2*4.7
LowAndSomeHighOr 5.3 1.4 6.8*   
 6.7 7.7 5.8 6.6
LowAndSomeLowNot-6.3   -24.4 3.7*   
 0.8 1.7

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-14 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060615#comment-14060615
 ] 

Da Huang edited comment on LUCENE-4396 at 7/14/14 1:06 PM:
---

I have done tests for different SIZE of bucketTable.
The file 'SIZE.perf' is the original test result data.

'stat.cpp' is a C++ program used to do statistic on *.perf files.
You can compile it with 'g++ stat.cpp -std=c++0x -o stat'
and run by './stat  SIZE.perf'

The statistic result for SIZE.perf is supposed to be as follows.
{code}
Task  size10  size11   size5   
size6   size7   size8   size9
  HighAndSomeHighNot   -14.5 4.0 6.6
-3.0 5.210.0*3.4
   HighAndSomeHighOr 2.410.917.3
17.412.918.321.3*
   HighAndSomeLowNot   -36.8   -37.3   -47.8   
-47.8   -40.2   -42.2   -41.5
HighAndSomeLowOr   -45.1   -46.4   -47.9   
-46.2   -38.7   -39.7   -44.9
  HighAndTonsHighNot   162.4*  145.1   149.1   
130.1   142.9   144.7   143.7
   HighAndTonsHighOr   154.8*  146.5   154.0   
137.8   144.9   150.0   149.1
   HighAndTonsLowNot   -27.0   -17.4   -73.7   
-49.6   -40.1   -28.6   -15.6
HighAndTonsLowOr   -28.7   -14.3   -63.8   
-44.8   -33.0   -24.4   -13.9
   LowAndSomeHighNot 3.0 0.2 4.5
 6.2*5.7 6.2*4.7
LowAndSomeHighOr 5.3 1.4 6.8*   
 6.7 7.7 5.8 6.6
LowAndSomeLowNot-6.3   -24.4 3.7*   
 0.8 1.7-2.3-4.0
 LowAndSomeLowOr   -10.3   -22.7 2.2*   
 2.0 1.7-2.3-8.8
   LowAndTonsHighNot27.3*   21.422.5
21.521.023.826.5
LowAndTonsHighOr23.128.224.2
23.929.1*   27.528.2
LowAndTonsLowNot33.046.539.1
33.430.047.2*   44.3
 LowAndTonsLowOr45.7*   34.629.9
36.845.340.938.1
{code}

size7 means the bucketTable's size is 1  7.
the character '*', which is added manually, marks the best value.

It seems that we can get a better result on \*Some\* tasks if we combine size9 
with size5.



was (Author: dhuang):
I have done tests for different SIZE of bucketTable.
The file 'SIZE.perf' is the original test result data.

'stat.cpp' is a C++ program used to do statistic on *.perf files.
You can compile it with 'g++ stat.cpp -std=c++0x -o stat'
and run by './stat  SIZE.perf'

The statistic result for SIZE.perf is supposed to be as follows.
{code}
Task  size10  size11   size5   
size6   size7   size8   size9
  HighAndSomeHighNot   -14.5 4.0 6.6
-3.0 5.210.0*3.4
   HighAndSomeHighOr 2.410.917.3
17.412.918.321.3*
   HighAndSomeLowNot   -36.8   -37.3   -47.8   
-47.8   -40.2   -42.2   -41.5
HighAndSomeLowOr   -45.1   -46.4   -47.9   
-46.2   -38.7   -39.7   -44.9
  HighAndTonsHighNot   162.4*  145.1   149.1   
130.1   142.9   144.7   143.7
   HighAndTonsHighOr   154.8*  146.5   154.0   
137.8   144.9   150.0   149.1
   HighAndTonsLowNot   -27.0   -17.4   -73.7   
-49.6   -40.1   -28.6   -15.6
HighAndTonsLowOr   -28.7   -14.3   -63.8   
-44.8   -33.0   -24.4   -13.9
   LowAndSomeHighNot 3.0 0.2 4.5
 6.2*5.7 6.2*4.7
LowAndSomeHighOr 5.3 1.4 6.8*   
 6.7 7.7 5.8 6.6
LowAndSomeLowNot-6.3

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-08 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055653#comment-14055653
]

Da Huang commented on LUCENE-4396:
--

Thanks for you suggestions, mike.
{quote}
Maybe try testing different values of SIZE?
{quote}
Hmm, that's a good idea.

{quote}
When you fold in the MUST_NOT clauses you could just compact the array
as you go, instead of having a separate valid bool?
{quote}
Oh, that's is a great idae! I will do that on next patch.

{quote}
I think we should start moving this towards something committable?
I.e., of all the approaches you've explored, let's take the most
promising and fold them into a new scorer, and then work on the
logic/heuristics for when this scorer should and shouldn't apply?
{quote}
Yeah, I agree. I am working on that.

BooleanScorer should sometimes be used for MUST clauses
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-07-04 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

This is a patch based on the git mirror commit 
7f66461aea7bc2cb6f31a993cba77734e5e0f9d9.

In this patch, I take the bucketTable as an array but not a hash table.
It seems that its perf. is better than former patches' on most cases.

As you know, after putting required docs into bucketTable, I have to scan both 
the table and optional docs. Here, I have tried skipping to scan the 
bucketTable to improve the perf. The results is as follows.


{code}
No skip
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndTonsLowNot6.56  (3.1%)2.59  (1.0%)  
-60.5% ( -62% -  -58%)
HighAndTonsLowOr6.43  (3.3%)2.58  (0.8%)  
-59.9% ( -61% -  -57%)
HighAndSomeLowOr8.49  (8.5%)4.05  (1.8%)  
-52.3% ( -57% -  -45%)
   HighAndSomeLowNot6.17  (8.6%)3.16  (2.1%)  
-48.8% ( -54% -  -41%)
 LowAndSomeLowOr  250.58  (2.0%)  194.86  (1.6%)  
-22.2% ( -25% -  -18%)
LowAndSomeLowNot  178.66  (1.6%)  147.67  (2.2%)  
-17.3% ( -20% -  -13%)
LowAndSomeHighOr   40.71  (2.8%)   41.50  (1.8%)
2.0% (  -2% -6%)
PKLookup   97.59  (3.0%)   99.52  (4.6%)
2.0% (  -5% -9%)
   LowAndSomeHighNot   20.76  (3.0%)   21.54  (2.3%)
3.7% (  -1% -9%)
  HighAndSomeHighNot2.22  (1.7%)2.67  (4.4%)   
20.3% (  13% -   26%)
   LowAndTonsHighNot3.81  (2.3%)4.60  (2.1%)   
20.8% (  15% -   25%)
LowAndTonsHighOr2.87  (2.3%)3.48  (2.6%)   
21.2% (  15% -   26%)
   HighAndSomeHighOr1.74  (2.1%)2.16  (3.5%)   
24.0% (  18% -   30%)
 LowAndTonsLowOr   18.66  (1.3%)   23.68  (1.9%)   
26.9% (  23% -   30%)
LowAndTonsLowNot   16.01  (1.4%)   22.16  (2.8%)   
38.4% (  33% -   43%)
   HighAndTonsHighOr0.04  (0.9%)0.11  (9.8%)  
158.2% ( 146% -  170%)
  HighAndTonsHighNot0.06  (1.1%)0.15 (13.5%)  
166.2% ( 149% -  182%)
  
---
Binary search skip
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndTonsLowNot6.22  (3.8%)2.45  (0.9%)  
-60.6% ( -62% -  -58%)
HighAndSomeLowOr8.29 (11.2%)4.40  (3.0%)  
-46.9% ( -54% -  -36%)
   HighAndSomeLowNot   12.34  (7.1%)6.65  (2.6%)  
-46.1% ( -52% -  -39%)
 LowAndSomeLowOr  232.38  (2.9%)  165.05  (1.8%)  
-29.0% ( -32% -  -24%)
HighAndTonsLowOr5.17  (6.2%)3.75  (3.0%)  
-27.4% ( -34% -  -19%)
LowAndSomeLowNot  227.71  (2.6%)  171.13  (3.2%)  
-24.8% ( -29% -  -19%)
   HighAndSomeHighOr1.35  (3.9%)1.14  (3.5%)  
-16.1% ( -22% -   -9%)
LowAndSomeHighOr   50.17  (3.6%)   48.84  (3.7%)   
-2.7% (  -9% -4%)
   LowAndSomeHighNot   52.71  (3.0%)   51.55  (3.8%)   
-2.2% (  -8% -4%)
PKLookup   90.17  (3.5%)   91.38  (3.3%)
1.3% (  -5% -8%)
  HighAndSomeHighNot1.69  (2.9%)2.00  (6.3%)   
18.5% (   8% -   28%)
 LowAndTonsLowOr   15.61  (1.9%)   18.59  (2.8%)   
19.0% (  14% -   24%)
LowAndTonsHighOr1.82  (2.7%)2.20  (4.6%)   
20.7% (  13% -   28%)
LowAndTonsLowNot   15.51  (1.7%)   20.14  (3.8%)   
29.8% (  23% -   35%)
   LowAndTonsHighNot1.01  (2.9%)1.34  (6.5%)   
31.7% (  21% -   42%)
   HighAndTonsHighOr0.07  (0.9%)0.12  (6.9%)   
77.7% (  69% -   86%)
  HighAndTonsHighNot0.07  (1.4%)0.19 (11.9%)  
162.4% ( 146% -  178%)
  
---
8 steps skip
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndTonsLowNot5.45  (3.3%)1.69  (1.3%)  
-69.0% ( -71% -  -66%)
HighAndSomeLowOr5.46 (11.0%)2.76  (4.4%)  
-49.5% ( -58% -  -38%)
   HighAndSomeLowNot   17.94  (5.7%)   10.40  (3.8%)  
-42.1% ( -48% -  -34%)
 LowAndSomeLowOr  306.62  (1.7%)  231.45  (1.5%)  
-24.5% ( -27% -  -21%)
LowAndSomeLowNot  286.30  (1.7%)  218.13  (2.0%)  
-23.8% ( -27% -  -20%)
HighAndTonsLowOr6.34

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-21 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039754#comment-14039754
]

Da Huang commented on LUCENE-4396:
--

{quote}
Looks like you separated required optional
scores in the non-DAAT impls and then carefully cast to float at the
right times?
{quote}
Yes, you get what I mean.
{quote}
you can remove that TODO in ConjunctionScorer on
switching sum to double?
{quote}
OK, I will do that on next patch.
{quote}
{quote}
So BooleanScorerIO is just like BooleanNovelScorer, except it uses a
bitset instead of linked list to track the set buckets? Between BNS
and BSIO which one is faster?
{quote}
Yes. exactly. According to perf. tests, it seems that
BNS do better for those tasks faster than the trunk,
while do better for those tasks slower than the trunk.
{quote}
Why does BSIO/NS see massive gains on the tasks that have so many NOT
clauses? I think in trunk/4.x today, we are not scoring the NOT
clauses, right? While these gains are sizable, I think it's not a
common use case...
{quote}
The reason is that when we search for +a -b -c -d,
lucene actually do +a -(b c d) and the cost of getting disjunction of (b c d)
is huge.
Indeed, such case may not be a common case.
{quote}
I think you've explored a number of options here and now we need to
see if we can make this committable, e.g. figure out how to have
BooleanQuery pick the right scorer for the situation? Somehow we need
logic that looks at how many / cost of the sub-clauses and picks the
right scorer?
{quote}
Yeah, you're right.

Besides, a new idea has come up to me. For BNS, we actually does not
make use of the hash feature of BucketTable. Thus, I think we should not
take BucketTable as a hash table (ie. do not place doc to the absolute place
buckets[doc MASK]).
Firstly, we get 2K required docs to BucketTable. Then, we do TAAT on these 2K
docs.

BooleanScorer should sometimes be used for MUST clauses
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-21 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039754#comment-14039754
]

Da Huang edited comment on LUCENE-4396 at 6/21/14 10:00 AM:

{quote}
Looks like you separated required optional
scores in the non-DAAT impls and then carefully cast to float at the
right times?
{quote}
Yes, you get what I mean.
{quote}
you can remove that TODO in ConjunctionScorer on
switching sum to double?
{quote}
OK, I will do that on next patch.
{quote}
So BooleanScorerIO is just like BooleanNovelScorer, except it uses a
bitset instead of linked list to track the set buckets? Between BNS
and BSIO which one is faster?
{quote}
Yes. exactly. According to perf. tests, it seems that
BNS do better for those tasks faster than the trunk,
while do better for those tasks slower than the trunk.
{quote}
Why does BSIO/NS see massive gains on the tasks that have so many NOT
clauses? I think in trunk/4.x today, we are not scoring the NOT
clauses, right? While these gains are sizable, I think it's not a
common use case...
{quote}
The reason is that when we search for +a -b -c -d,
lucene actually do +a -(b c d) and the cost of getting disjunction of (b c d)
is huge.
Indeed, such case may not be a common case.
{quote}
I think you've explored a number of options here and now we need to
see if we can make this committable, e.g. figure out how to have
BooleanQuery pick the right scorer for the situation? Somehow we need
logic that looks at how many / cost of the sub-clauses and picks the
right scorer?
{quote}
Yeah, you're right.

was (Author: dhuang):
{quote}
Looks like you separated required optional
scores in the non-DAAT impls and then carefully cast to float at the
right times?
{quote}
Yes, you get what I mean.
{quote}
you can remove that TODO in ConjunctionScorer on
switching sum to double?
{quote}
OK, I will do that on next patch.
{quote}
{quote}
So BooleanScorerIO is just like BooleanNovelScorer, except it uses a
bitset instead of linked list to track the set buckets? Between BNS
and BSIO which one is faster?
{quote}
Yes. exactly. According to perf. tests, it seems that
BNS do better for those tasks faster than the trunk,
while do better for those tasks slower than the trunk.
{quote}
Why does BSIO/NS see massive gains on the tasks that have so many NOT
clauses? I think in trunk/4.x today, we are not scoring the NOT
clauses, right? While these gains are sizable, I think it's not a
common use case...
{quote}
The reason is that when we search for +a -b -c -d,
lucene actually do +a -(b c d) and the cost of getting disjunction of (b c d)
is huge.
Indeed, such case may not be a common case.
{quote}
I think you've explored a number of options here and now we need to
see if we can make this committable, e.g. figure out how to have
BooleanQuery pick the right scorer for the situation? Somehow we need
logic that looks at how many / cost of the sub-clauses and picks the
right scorer?
{quote}
Yeah, you're right.

BooleanScorer should sometimes be used for MUST clauses
---

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-20 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

This is a patch based on git mirror commit 
8f9b823db1d6fba2cc7ec61b0596970f3c8bbe85.
The following things are done in this patch.

1. Solve the problem of score diff. between pure DAAT(ie. BS2, as BS2 does not 
exist now, I think it may be better to call it pure DAAT) and BS completely.

2. Add a new Scorer named BooleanScorerInOrder which uses only bitset but not 
linked list to collect docs.
I create this new Scorer but not change the old BS, because I think BS may be 
more useful in some cases.
For now, BSIO does not support the cases where there is no any MUST clause, 
because the procedure for such cases is totally different from cases with MUST 
clause.

The perf. of BSIO is as follows.
{code}
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
 LowAndSomeLowOr  259.82  (2.3%)  102.70  (2.8%)  
-60.5% ( -64% -  -56%)
LowAndSomeLowNot  184.38  (2.8%)   80.26  (2.3%)  
-56.5% ( -59% -  -52%)
   HighAndSomeLowNot   10.44  (7.2%)4.70  (4.3%)  
-55.0% ( -61% -  -46%)
HighAndSomeLowOr   18.11  (8.0%)8.83  (4.0%)  
-51.2% ( -58% -  -42%)
   HighAndTonsLowNot3.03  (5.4%)1.62  (4.7%)  
-46.8% ( -53% -  -38%)
LowAndTonsLowNot   14.59  (1.2%)8.86  (2.0%)  
-39.3% ( -41% -  -36%)
 LowAndTonsLowOr   14.11  (1.1%)8.74  (3.0%)  
-38.1% ( -41% -  -34%)
HighAndTonsLowOr5.52  (4.3%)3.85  (5.2%)  
-30.2% ( -38% -  -21%)
LowAndSomeHighOr   24.97  (3.5%)   21.10  (3.2%)  
-15.5% ( -21% -   -9%)
   LowAndSomeHighNot   25.51  (3.3%)   23.22  (3.4%)   
-9.0% ( -15% -   -2%)
LowAndTonsHighOr1.66  (2.6%)1.64  (2.8%)   
-1.1% (  -6% -4%) 
PKLookup   95.22  (5.5%)   96.64  (6.1%)
1.5% (  -9% -   13%)
  HighAndSomeHighNot2.37  (2.0%)2.55  (6.9%)
7.4% (  -1% -   16%)
   HighAndSomeHighOr2.25  (2.7%)2.43  (6.0%)
7.8% (   0% -   16%)
   LowAndTonsHighNot2.72  (2.3%)5.94  (5.8%)  
118.4% ( 107% -  129%)
   HighAndTonsHighOr0.05  (0.8%)0.12 (17.0%)  
162.4% ( 143% -  181%)
  HighAndTonsHighNot0.08  (1.3%)0.48 (23.4%)  
507.0% ( 476% -  538%)
{code}

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, 
 luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-09 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025945#comment-14025945
 ] 

Da Huang commented on LUCENE-4396:
--

Hmm. I mean ConjunctionScorer does not use PQ, and it can be faster to use it 
rather than enumerating all the matching docs for each MUST.
As for .advance, I'm not sure whether its cost can exceed .next much enough, so 
that using .advance will be slower than using .next in this case.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-08 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021505#comment-14021505
 ] 

Da Huang commented on LUCENE-4396:
--

{quote}True, but maybe in such cases (low freqs for the clauses) we should just 
use BS2. I think BS/BNS do better for high-freq clauses?{quote}
I'm sorry that I could not be sure whether it's ture now, as I haven't made a 
closer analysis on the perf results. 
The perf of BS/BNS depends on many factors, such as freq of each clause and the 
number of SHOULD(and MUST_NOT) clauses.

{quote}I think we may get better performance when the MUST clauses are high 
freq, if we just use BooleanScorer to enumerate all the matching docs for each 
MUST instead of going through ConjunctionScorer?{quote}
I afraid that enumerating all the matching docs would not get better perf. 
In fact, BS2 and ConjunctionScorer collect docs by the method called 
document-at-a-time(DAAT), 
while BS/BNS is something like a combination of DAAT and 
term-at-a-time(TAAT). 
For conjunctive clauses, it's more efficient to use DAAT than TAAT, as DAAT 
scans fewer docs than TAAT.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-04 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017708#comment-14017708
 ] 

Da Huang commented on LUCENE-4396:
--

Thanks for your suggestions, Mike!

{quote}
When you say BNS (without bitset) vs. BS2 that means baseline=BS2
and my_version=BNS (without bitset)?
{quote}
Yes, this is just what I mean.

{quote}
With the added bitset, couldn't you not use a linked list anymore?
Ie, just use prev/nextSetBit. I wonder if the bitset (instead of the
linked list) could also help BooleanScorer? Maybe test this change
separately (e.g. just modify BS we have today on trunk) to see if it
helps or hurts... if it does help, it seems like BNS could be
used (or BS could be a Scorer not a BulkScorer) even when there are no
MUST clauses? Ie, the bitset lets us easily keep the order. Then we
can merge BS/BNS into one?
{quote}
Oh, that's a good idea! I will try that. However, linked list can be helpful 
when required docs is extremly sparse.

{quote}
Could you attach all new tasks as a single file in general? Note that
when you set up a luceneutil test, you can add a task filter using
addTaskPattern, so you run just a subset of the tasks for that one
test.
{quote}
Do you mean merging And.tasks and AndOr.tasks ? If so, there's no need to
do that, because And.tasks contains all tasks in AndOr.tasks, although tasks'
names are changed.
All the way, thanks for the advice on using addTaskPattern. I haven't noticed 
that.

{quote}
Strange that the scores are still different between BS/BS2 and BNS/BS2
when using double.
{quote}
I don't think it strange. Because the difference is due to the score 
calculating order.
Supposed that a doc hits +a b c, 
SCORE_BS = (float)((float)(double)score_a + (float)score_b) + (float)score_c, 
while 
SCORE_BS2 = (float)(double)score_a + ((float)score_b + (float)score_c). 
Here, (float) means that we can only get the score by .score() whose return 
type is float.
The modification on this patch can only make score_a has a temp double value.

{quote}
If there's only 1 required clause sent to BS/BNS can't we use its scorer
instead?
Have you explored having BS interact directly with all the MUST
clauses, rather than using ConjunctionScorer?
{quote}
Hmm. I don't think that would be helpful. The reason is just the same as above.

{quote}
Because we have wildly divergent results (sometimes one is much
faster, other times it's much slower) we will somehow need to add
logic to pick the right scorer for each query. But we can defer this
until we're doneish iterating the changes to each scorer... it can
come later on.
{quote}
Yes, I agree.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-04 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018417#comment-14018417
 ] 

Da Huang commented on LUCENE-4396:
--

About scores diff. on BS/BS2 (the same as BNS/BS2)

Now, there's scores diff. on BS/BS2, when excuting query like +a b c d 

I have been told that the reason is indicate by 
the TODO on ReqOptSumScorer.score() which says that
{code}
// TODO: sum into a double and cast to float if we ever send required clauses 
to BS1
{code}

However, I don't think so, as the score bias is due to
different score calculating orders.

Supposed that a doc hits the query +a b c d, the score calculated by BS is 
{code}
BS.score(doc) = ((a.score() + b.score()) + c.score()) + d.score()
{code}
while the score calculated by BS2 is 
{code}
BS2.score(doc) = a.score() + (float)(b.score() + c.score() + d.score())
{code}

Notice that, in BS2, we can only get the float value of (b.score() + c.score() 
+ d.score())
by reqScorer.score().

Furthermore, I have noticed that actually we can control the BS's 
score calulating order, so that 
{code}
BS.score(doc) = a.score() + ((b.score() + c.score()) + d.score())
{code}
However, for BS2, we do not know the calculating order of 
(b.score() + c.score() + d.score()), as the order is determined by 
scorer's position in a heap. I still think this matters little.

I will rearrange the calculating order of BS.score() at next patch, 
to see whether it works.


 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-03 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch
And.tasks

A patch based on lucene github mirror commit 
cf10341825ff6bd1662dd48c51926bc51d751ce5.

I use a bitset to skip required docs when scaning optional and prohibited docs. 
The perf. comparison is at the bottom.

Besides, I build a new tasks file the test the perf. and I discover that BNS 
optimize the +a -b -c -d ... case a lot, when b c d ... hits many docs.

code
BNS (without bitset) vs. BS2
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndTonsLowNot4.29  (2.9%)1.08  (0.6%)  
-74.8% ( -76% -  -73%)
HighAndTonsLowOr4.87  (6.4%)1.24  (1.0%)  
-74.4% ( -76% -  -71%)
   HighAndSomeLowNot9.03  (5.2%)4.11  (4.1%)  
-54.4% ( -60% -  -47%)
HighAndSomeLowOr   16.21  (9.6%)7.75  (4.1%)  
-52.2% ( -60% -  -42%)
 LowAndSomeLowOr  303.28  (2.4%)  183.14  (6.6%)  
-39.6% ( -47% -  -31%)
LowAndSomeLowNot  257.24  (1.8%)  157.07  (6.5%)  
-38.9% ( -46% -  -31%)
LowAndSomeHighOr   36.78  (1.9%)   33.74  (3.0%)   
-8.3% ( -12% -   -3%)
LowAndTonsLowNot   21.28  (2.0%)   19.69  (6.9%)   
-7.5% ( -16% -1%)
   LowAndSomeHighNot   34.40  (1.6%)   33.69  (3.2%)   
-2.1% (  -6% -2%)
PKLookup  100.63  (4.8%)  103.46  (4.7%)
2.8% (  -6% -   12%)
LowAndTonsHighOr1.26  (1.6%)1.41  (1.7%)   
11.8% (   8% -   15%)
 LowAndTonsLowOr   13.66  (0.9%)   15.50  (6.0%)   
13.5% (   6% -   20%)
  HighAndSomeHighNot2.65  (1.4%)3.12  (6.5%)   
17.6% (   9% -   25%)
   HighAndSomeHighOr2.21  (2.4%)2.62  (5.8%)   
18.6% (  10% -   27%)
   HighAndTonsHighOr0.07  (0.8%)0.19 (10.5%)  
160.3% ( 147% -  172%)
   LowAndTonsHighNot2.86  (1.6%)   10.24 (18.1%)  
257.7% ( 234% -  281%)
  HighAndTonsHighNot0.05  (0.8%)0.40 (28.2%)  
641.8% ( 607% -  676%)
  

BS vs. BS2
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
HighAndTonsLowOr4.02  (6.8%)0.87  (0.5%)  
-78.2% ( -80% -  -76%)
   HighAndTonsLowNot4.95  (3.4%)1.29  (0.9%)  
-73.9% ( -75% -  -72%)
HighAndSomeLowOr   14.45  (9.5%)6.68  (3.7%)  
-53.8% ( -61% -  -44%)
   HighAndSomeLowNot   14.78  (5.1%)7.48  (3.9%)  
-49.4% ( -55% -  -42%)
 LowAndSomeLowOr  316.55  (2.2%)  170.14  (5.6%)  
-46.3% ( -52% -  -39%)
LowAndSomeLowNot  283.47  (1.7%)  157.35  (6.0%)  
-44.5% ( -51% -  -37%)
LowAndSomeHighOr   39.39  (2.0%)   35.07  (3.1%)  
-11.0% ( -15% -   -6%)
   LowAndSomeHighNot   53.96  (2.0%)   48.57  (3.8%)  
-10.0% ( -15% -   -4%)
LowAndTonsLowNot   17.97  (1.5%)   17.04  (6.0%)   
-5.2% ( -12% -2%)
PKLookup   97.57  (2.7%)  100.21  (5.2%)
2.7% (  -5% -   10%)
LowAndTonsHighOr3.59  (1.7%)3.74  (2.4%)
4.1% (   0% -8%)
 LowAndTonsLowOr   14.71  (1.3%)   15.63  (5.7%)
6.3% (   0% -   13%)
  HighAndSomeHighNot1.84  (1.3%)2.05  (5.6%)   
11.2% (   4% -   18%)
   HighAndSomeHighOr1.93  (2.1%)2.16  (5.6%)   
11.9% (   4% -   20%)
   HighAndTonsHighOr0.05  (1.0%)0.13 (14.1%)  
144.8% ( 128% -  161%)
   LowAndTonsHighNot1.63  (1.9%)4.95  (7.2%)  
204.0% ( 191% -  217%)
  HighAndTonsHighNot0.06  (1.0%)0.34 (18.2%)  
459.6% ( 435% -  483%)


BNS (with bitset) vs. BS2
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
HighAndSomeLowOr7.45 (12.0%)3.49  (6.6%)  
-53.1% ( -64% -  -39%)
   HighAndSomeLowNot   10.45  (8.0%)5.25  (6.8%)  
-49.7% ( -59% -  -37%)
 LowAndSomeLowOr  310.53  (2.3%)  168.56  (5.8%)  
-45.7% ( -52% -  -38%)
LowAndSomeLowNot  292.05  (2.3%)  165.88  (5.7%)  
-43.2% ( -50% -  -36%)
   HighAndTonsLowNot5.94  (3.5%)4.33  (6.8%)  
-27.0% ( -36% -  -17%)
HighAndTonsLowOr5.92  (4.4%)4.39  (6.0%)  
-25.9% ( -34% -  -16%)
   LowAndSomeHighNot   53.79  (2.4%)   47.71  (2.8%)  
-11.3

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-06-03 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016491#comment-14016491
 ] 

Da Huang edited comment on LUCENE-4396 at 6/3/14 1:24 PM:
--

A patch based on lucene github mirror commit 
cf10341825ff6bd1662dd48c51926bc51d751ce5.

I use a bitset to skip required docs when scaning optional and prohibited docs. 
The perf. comparison is at the bottom.

Besides, I build a new tasks file the test the perf. and I discover that BNS 
optimize the +a -b -c -d ... case a lot, when b c d ... hits many docs.

{code}
BNS (without bitset) vs. BS2
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
   HighAndTonsLowNot4.29  (2.9%)1.08  (0.6%)  
-74.8% ( -76% -  -73%)
HighAndTonsLowOr4.87  (6.4%)1.24  (1.0%)  
-74.4% ( -76% -  -71%)
   HighAndSomeLowNot9.03  (5.2%)4.11  (4.1%)  
-54.4% ( -60% -  -47%)
HighAndSomeLowOr   16.21  (9.6%)7.75  (4.1%)  
-52.2% ( -60% -  -42%)
 LowAndSomeLowOr  303.28  (2.4%)  183.14  (6.6%)  
-39.6% ( -47% -  -31%)
LowAndSomeLowNot  257.24  (1.8%)  157.07  (6.5%)  
-38.9% ( -46% -  -31%)
LowAndSomeHighOr   36.78  (1.9%)   33.74  (3.0%)   
-8.3% ( -12% -   -3%)
LowAndTonsLowNot   21.28  (2.0%)   19.69  (6.9%)   
-7.5% ( -16% -1%)
   LowAndSomeHighNot   34.40  (1.6%)   33.69  (3.2%)   
-2.1% (  -6% -2%)
PKLookup  100.63  (4.8%)  103.46  (4.7%)
2.8% (  -6% -   12%)
LowAndTonsHighOr1.26  (1.6%)1.41  (1.7%)   
11.8% (   8% -   15%)
 LowAndTonsLowOr   13.66  (0.9%)   15.50  (6.0%)   
13.5% (   6% -   20%)
  HighAndSomeHighNot2.65  (1.4%)3.12  (6.5%)   
17.6% (   9% -   25%)
   HighAndSomeHighOr2.21  (2.4%)2.62  (5.8%)   
18.6% (  10% -   27%)
   HighAndTonsHighOr0.07  (0.8%)0.19 (10.5%)  
160.3% ( 147% -  172%)
   LowAndTonsHighNot2.86  (1.6%)   10.24 (18.1%)  
257.7% ( 234% -  281%)
  HighAndTonsHighNot0.05  (0.8%)0.40 (28.2%)  
641.8% ( 607% -  676%)
  

BS vs. BS2
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
HighAndTonsLowOr4.02  (6.8%)0.87  (0.5%)  
-78.2% ( -80% -  -76%)
   HighAndTonsLowNot4.95  (3.4%)1.29  (0.9%)  
-73.9% ( -75% -  -72%)
HighAndSomeLowOr   14.45  (9.5%)6.68  (3.7%)  
-53.8% ( -61% -  -44%)
   HighAndSomeLowNot   14.78  (5.1%)7.48  (3.9%)  
-49.4% ( -55% -  -42%)
 LowAndSomeLowOr  316.55  (2.2%)  170.14  (5.6%)  
-46.3% ( -52% -  -39%)
LowAndSomeLowNot  283.47  (1.7%)  157.35  (6.0%)  
-44.5% ( -51% -  -37%)
LowAndSomeHighOr   39.39  (2.0%)   35.07  (3.1%)  
-11.0% ( -15% -   -6%)
   LowAndSomeHighNot   53.96  (2.0%)   48.57  (3.8%)  
-10.0% ( -15% -   -4%)
LowAndTonsLowNot   17.97  (1.5%)   17.04  (6.0%)   
-5.2% ( -12% -2%)
PKLookup   97.57  (2.7%)  100.21  (5.2%)
2.7% (  -5% -   10%)
LowAndTonsHighOr3.59  (1.7%)3.74  (2.4%)
4.1% (   0% -8%)
 LowAndTonsLowOr   14.71  (1.3%)   15.63  (5.7%)
6.3% (   0% -   13%)
  HighAndSomeHighNot1.84  (1.3%)2.05  (5.6%)   
11.2% (   4% -   18%)
   HighAndSomeHighOr1.93  (2.1%)2.16  (5.6%)   
11.9% (   4% -   20%)
   HighAndTonsHighOr0.05  (1.0%)0.13 (14.1%)  
144.8% ( 128% -  161%)
   LowAndTonsHighNot1.63  (1.9%)4.95  (7.2%)  
204.0% ( 191% -  217%)
  HighAndTonsHighNot0.06  (1.0%)0.34 (18.2%)  
459.6% ( 435% -  483%)


BNS (with bitset) vs. BS2
TaskQPS baseline  StdDevQPS my_version  StdDev  
  Pct diff
HighAndSomeLowOr7.45 (12.0%)3.49  (6.6%)  
-53.1% ( -64% -  -39%)
   HighAndSomeLowNot   10.45  (8.0%)5.25  (6.8%)  
-49.7% ( -59% -  -37%)
 LowAndSomeLowOr  310.53  (2.3%)  168.56  (5.8%)  
-45.7% ( -52% -  -38%)
LowAndSomeLowNot  292.05  (2.3%)  165.88  (5.7%)  
-43.2% ( -50% -  -36%)
   HighAndTonsLowNot5.94  (3.5%)4.33  (6.8%)  
-27.0% ( -36% -  -17%)
HighAndTonsLowOr5.92  (4.4%)4.39  (6.0%)  
-25.9% ( -34% -  -16%)
   LowAndSomeHighNot   53.79

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-25 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008299#comment-14008299
 ] 

Da Huang edited comment on LUCENE-4396 at 5/25/14 8:44 AM:
---

The patch is based on lucene github mirror commit 
cfb408ff6788e6fea8215098a785d72fb4e95c5b.

The following things have been done:

1. Rename TestBooleanNovelScorer to TestBooleanUnevenly, and this test suit 
test both BNS and BS when hit documents' distribution is unevenly.

2. Following Robert's advice, I sum scores into a double and cast to float in 
ConjunctionScorer. However, it seems to take little effect. Scores difference 
problem still remain.

3. Add a comment to scores difference within tolerance on luceneutil.

4. Make a new tasks file, which can test AndSomeOR cases.

5. Run luceneutil for BNS vs BS2 and BS vs BS2. The result is showed as 
follows.

P.S. BS has the same problem with score difference as BNS.
Althrough there's no BS2 now as the architecture has changed, here I still call 
it BS2 for convenience.

{code}

BNS vs BS2

TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighAndTonsLowOr   10.95  (3.5%)1.52  (0.3%)  
-86.1% ( -86% -  -85%)
HighAndSomeLowOr   29.98  (6.7%)   11.84  (2.9%)  
-60.5% ( -65% -  -54%)
 LowAndSomeLowOr  756.81  (1.4%)  503.21  (2.8%)  
-33.5% ( -37% -  -29%)
LowAndSomeHighOr   54.25  (2.1%)   53.26  (2.1%)   
-1.8% (  -5% -2%)
PKLookup  241.74  (2.8%)  241.96  (2.3%)
0.1% (  -4% -5%)
 LowAndTonsLowOr   40.23  (1.2%)   43.19  (7.2%)
7.4% (   0% -   15%)
LowAndTonsHighOr2.63  (2.1%)2.99  (2.3%)   
13.8% (   9% -   18%)
   HighAndSomeHighOr4.99  (1.8%)5.86  (4.7%)   
17.4% (  10% -   24%)
   HighAndTonsHighOr0.09  (1.5%)0.22  (8.1%)  
145.4% ( 133% -  157%)


BS vs BS2

TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighAndTonsLowOr   16.54  (2.4%)3.70  (0.2%)  
-77.6% ( -78% -  -76%)
HighAndSomeLowOr   11.95  (8.5%)4.29  (0.8%)  
-64.1% ( -67% -  -59%)
 LowAndSomeLowOr  839.11  (1.9%)  540.83  (2.5%)  
-35.5% ( -39% -  -31%)
LowAndSomeHighOr  149.50  (2.6%)  136.71  (3.4%)   
-8.6% ( -14% -   -2%)
   HighAndSomeHighOr3.72  (1.7%)3.51  (1.7%)   
-5.6% (  -8% -   -2%)
PKLookup  240.32  (2.8%)  238.87  (2.8%)   
-0.6% (  -6% -5%)
LowAndTonsHighOr4.96  (2.3%)5.35  (3.8%)
7.8% (   1% -   14%)
 LowAndTonsLowOr   35.28  (1.2%)   39.00  (5.2%)   
10.6% (   4% -   17%)
   HighAndTonsHighOr0.16  (1.1%)0.36  (4.0%)  
122.6% ( 116% -  129%)
{code}


was (Author: dhuang):
The patch is based on lucene github mirror commit 
cfb408ff6788e6fea8215098a785d72fb4e95c5b.

The following things have been done:

1. Rename TestBooleanNovelScorer to TestBooleanUnevenly, and this test suit 
test both BNS and BS when hit documents' distribution is unevenly.

2. Following Robert's advice, I sum scores into a double and cast to float in 
ConjunctionScorer. However, it seems to take little effect. Scores difference 
problem still remain.

3. Add a comment to scores difference within tolerance on luceneutil.

4. Make a new tasks file, which can test AndSomeOR cases.

5. Run luceneutil for BNS vs BS2 and BS vs BS2. The result is showed as 
follows.


{code}

BNS vs BS2

TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighAndTonsLowOr   10.95  (3.5%)1.52  (0.3%)  
-86.1% ( -86% -  -85%)
HighAndSomeLowOr   29.98  (6.7%)   11.84  (2.9%)  
-60.5% ( -65% -  -54%)
 LowAndSomeLowOr  756.81  (1.4%)  503.21  (2.8%)  
-33.5% ( -37% -  -29%)
LowAndSomeHighOr   54.25  (2.1%)   53.26  (2.1%)   
-1.8% (  -5% -2%)
PKLookup  241.74  (2.8%)  241.96  (2.3%)
0.1% (  -4% -5%)
 LowAndTonsLowOr   40.23  (1.2%)   43.19  (7.2%)
7.4% (   0% -   15%)
LowAndTonsHighOr2.63  (2.1%)2.99  (2.3%)   
13.8% (   9% -   18%)
   HighAndSomeHighOr4.99  (1.8%)5.86  (4.7%)   
17.4% (  10% -   24%)
   HighAndTonsHighOr0.09  (1.5%)0.22  (8.1%)  
145.4% ( 133% -  157%)


BS vs BS2

TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-19 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001595#comment-14001595
]

Da Huang commented on LUCENE-4396:
--

Thanks for your reply.
{quote}OK, maybe add a comment just saying something temporarily commented out
so NovelBS is invoked instead of BS?{quote}
I will comment that.
{quote} And what exception did luceneutil throw...?{quote}
It just says that hit %s has wrong field/score value %s vs %s, and the perf.
test abort. And the score value diff. is about 0.01 .

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch,
LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-19 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001706#comment-14001706
]

Da Huang commented on LUCENE-4396:
--

{quote} I'm nervous about the luceneutil change, just because I don't want to
encourage complacency on scores being different in general. {quote}
I agree. but it seems that the original perf. tasks file has too few items on
each case to discover scores' difference, when the scorer's calculating orders
are different.
Actually, if I decrease the items in my tasks file on each case to 3, the
scores are the same with the trunk.

BooleanScorer should sometimes be used for MUST clauses
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-19 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002604#comment-14002604
 ] 

Da Huang commented on LUCENE-4396:
--

Thanks for your advice, Robert.
Do you mean just changing
{code}
float sum = 0.0f;
{code}
to 
{code}
double sum = 0.0f;
{code} ?

However, I'm not sure doing this will really be enough for scoring differences, 
as the differences are due to different calculating order.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-19 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002620#comment-14002620
 ] 

Da Huang commented on LUCENE-4396:
--

Oh, thanks. I think it‘s worth a try.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-16 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998736#comment-13998736
 ] 

Da Huang commented on LUCENE-4396:
--

Thanks for your suggestions!

{quote}
maybe we could test on fewer terms, for the
Low/HighAndManyLow/High tasks? I think it's more common to have a
handful (3-5 maybe) of terms.
{quote}
When terms are few, BooleanNovelScorer performs slower than BS (about -10%).
However, I have to generate tasks with fewer terms and rerun the tasks to 
reconfirm
the specific perf. difference.

{quote}
 But maybe keep your current category
and rename it to Tons instead of Many?
{quote}
OK, I will do so.

{quote}
Maybe we can improve
the test so that it exercises BS and NBS? E.g., toggle the require
docs in order via a custom collector?
{quote}
Yes, I think that's a good idea.

{quote}
Hmm do we know why the scores changed?
{quote}
Yes, it's because the calculating orders are different. 
BS adds up scores of all SHOULD clauses, and then add their sum to the final 
score.
BNS adds score of each SHOULD clause to final score one by one.

{quote}
Are we comparing BS2 to NovelBS?
{quote}
Yes.

{quote}
I think BS and BS2 already have different scores today?
{quote}
Yes. Actually, the score calculating order of BS is the same as BNS.

{quote}
but you commented this out in your patch in order to test NBS I
guess?
{quote}
yes, I did that in order to test BNS. Otherwise, luceneutil would throw 
exception.

{quote}
Do you have any perf results of BS w/ required clauses (as a
BulkScorer) vs BS2 (what trunk does today)?
{quote}
Hmm, I haven't carried out such experiment yet. Checking the perf. results of 
BS vs BS2 
is a good idea. I will do that.  :) 



 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-12 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

Add TestBooleanNovelScorer.java to detect the bug on the second patch.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-10 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

The patch is based on the github mirror commit
c1e423e45e6fa9f846ab2c382c0100fd515b4cb1.

The following things are done in this patch:

1. Fix the bug on last patch. The bug is due to not setting prev and next to
null before add an element to a linked list.

2. Refine the code style.

3. Make a small improvement on .advance(). The performance is a little better
than the last patch, but still worse than the trunk, when testing on luceneutil.

P.S. The bug on last patch can not be detected by ant-test, but can be found by
running query like +a b on luceneutil. I'm getting to add a junit test case
which can detect the bug, but it may take me some days.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Attachments: LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-10 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: AndOr.tasks

luceneutil tasks file to test queries like +a b c d e ...

The performance shows as follows.
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighAndManyLowOr8.50  (3.3%)1.72  (0.3%)  
-79.8% ( -80% -  -78%)
PKLookup  239.75  (0.9%)  239.99  (0.9%)
0.1% (  -1% -1%)
LowAndManyHighOr7.11  (1.4%)7.76  (1.4%)
9.1% (   6% -   12%)
 LowAndManyLowOr   33.83  (0.7%)   41.03  (2.7%)   
21.3% (  17% -   24%)
   HighAndManyHighOr0.12  (0.7%)0.29  (7.8%)  
148.0% ( 138% -  157%)


 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-10 Thread Da Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Da Huang updated LUCENE-4396:
-

Attachment: luceneutil-score-equal.patch

A patch for luceneutil, which allows scores is different within a tolerance 
range.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch, luceneutil-score-equal.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-05-10 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994363#comment-13994363
 ] 

Da Huang edited comment on LUCENE-4396 at 5/11/14 12:56 AM:


luceneutil tasks file to test queries like +a b c d e ...

The performance shows as follows.
||TaskQPS || baseline || StdDevQPS || 
my_modified_version || StdDev ||   Pct diff ||
   | HighAndManyLowOr| 8.50|   (3.3%)| 1.72|   
(0.3%) |  -79.8% ( -80% -  -78%) | 
  |   PKLookup|   239.75   |(0.9%)  | 239.99|   
(0.9%)  |   0.1% (  -1% -1%) | 
 |LowAndManyHighOr| 7.11|   (1.4%)   |  7.76|   
(1.4%)  |   9.1% (   6% -   12%) | 
  |LowAndManyLowOr|33.83|   (0.7%)   | 41.03|   
(2.7%)  |  21.3% (  17% -   24%) | 
|HighAndManyHighOr| 0.12   |(0.7%)  |   0.29   |
(7.8%) |  148.0% ( 138% -  157%) | 



was (Author: dhuang):
luceneutil tasks file to test queries like +a b c d e ...

The performance shows as follows.
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighAndManyLowOr8.50  (3.3%)1.72  (0.3%)  
-79.8% ( -80% -  -78%)
PKLookup  239.75  (0.9%)  239.99  (0.9%)
0.1% (  -1% -1%)
LowAndManyHighOr7.11  (1.4%)7.76  (1.4%)
9.1% (   6% -   12%)
 LowAndManyLowOr   33.83  (0.7%)   41.03  (2.7%)   
21.3% (  17% -   24%)
   HighAndManyHighOr0.12  (0.7%)0.29  (7.8%)  
148.0% ( 138% -  157%)


 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, 
 LUCENE-4396.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-04-30 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985254#comment-13985254
]

Da Huang commented on LUCENE-4396:
--

Thanks for your suggestions, Mike. And sorry for my late reply.
{quote}
Hmm, the patch didn't cleanly apply, but I was able to work through
it. I think your dev area is not up to date with trunk?
{quote}
I haven't merged my branch to the newest trunk version, because my network
account at school for April has been run out and I couldn't pull the code from
github untill 1 May. Sorry for that.
{quote}
Small code style things
{quote}
I'm very sorry for the code style. That's my fault. Very sorry for that.
{quote}
So it looks like BooleanNovelScorer is able to be a Scorer because the
linked-list of visited buckets in one window are guaranteed to be in
docID order, because we first visit the requiredConjunctionScorer's
docs in that window.
{quote}
Yes, you're right.
{quote}
Have you tested performance when the .advance method here isn't called?
Ie, just boolean queries w/ one MUST and one or more SHOULD?
{quote}
No, I haven't. Do you mean the .advance method of subScorers in
BooleanNovelScorer?
If so, I will do that.
If you mean the .advance method of BooleanNovelScorer itself, I think it would
be confusing,
because BooleanNovelScorer now is used when there's at least one MUST clause,
no matter whether it acts as a top scorer or not. Therefore, .advance() of
BooleanNovelScorer
must be called when BooleanNovelScorer acts as a non-top scorer.
{quote}
I think the important question here is whether/in what cases the
BooleanNovelScorer approach beats BooleanScorer2 performance?
{quote}
Yes, you're right. But BooleanNovelScorer has not been totally finished, and
the performance itself
remans to be improved especially its .advance method.
{quote}
I realized LUCENE-4872 is related here, i.e. we should also sometimes
use BooleanScorer for the minShouldMatch1 case.
{quote}
Yes, I also notice that. :) I think this issue should be dealed with together.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Attachments: LUCENE-4396.patch, LUCENE-4396.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-04-27 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

I create a new class named BooleanNovelScorer in this iteration.
This scorer is based on the techinque of BooleanScorer, but can make use of the
skipping list while collecting documents.
Moreover, it is a subclass of Scorer which can act as a non-top scorer.
However, the performance is low now, because I have not implemented its
.advance() in a efficent way.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Attachments: LUCENE-4396.patch, LUCENE-4396.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-04-15 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Da Huang updated LUCENE-4396:
-

Attachment: LUCENE-4396.patch

BooleanScorer can support MUST clause (ie. requiredScorers) now.
The patch is based on commit 9e87821edeb3e24ca8dedaecf856f6510d61d0d3 on github.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Attachments: LUCENE-4396.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-04-15 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969418#comment-13969418
 ] 

Da Huang commented on LUCENE-4396:
--

Currently, I pass ListScorer requiredScorers to BooleanScorer, and merge 
them as ConjunctionScorer.

For consistency, I should probably change the argument from ListScorer 
requiredScorers to ListBulkScorer requiredScorers, but, as a result, 
getScorer method should be added to BulkScorer.
Besides, I removed the static statement on BooleanScorerCollector and 
BucketTable, because I have to refer the member requiredNrMatchers of 
BooleanScorer. But, I'm so sure whether removing the static statement is a 
proper option.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4396.patch


 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-04-15 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970193#comment-13970193
]

Da Huang commented on LUCENE-4396:
--

These suggestions are very helpful. Thank you.:)

Adding up a mustClauseCountMatches counter would be low-efficient, as it can
not make use of list skippings.
How about implementing getScorer returning null for BulkScorer, while returning
the scorer for DefaultBulkScorer.
I'm not very sure whether passing ListBulkScorer instead of ListScorer is
really necessary.
So I think this issue should probably be just set aside for now.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Attachments: LUCENE-4396.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-21 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942978#comment-13942978
]

Da Huang commented on LUCENE-4396:
--

Oh, that's very true. According to the current design, using advance will hurt
is some cases.

However, I think such cases may be able to be solved by building a skipping
list on MUST-all-hit bulks and skipping MUST-all-hit bulks when scanning
SHOULD, but I haven't made this idea very clear in my mind. So making
BooleanBulkScorer is still necessary now.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-21 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942978#comment-13942978
]

Da Huang edited comment on LUCENE-4396 at 3/21/14 11:19 AM:

Oh, that's very true. According to the current design, using advance will hurt
in such case.

However, I think such case may be able to be solved by building a skipping list
on MUST-all-hit bulks and skipping MUST-all-hit bulks when scanning SHOULD, but
I haven't made this idea very clear in my mind. So making BooleanBulkScorer is
still necessary now.

was (Author: dhuang):
Oh, that's very true. According to the current design, using advance will hurt
is some cases.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-20 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941775#comment-13941775
]

Da Huang commented on LUCENE-4396:
--

Sorry for my late reply. I have been thinking about the new code/design on the
trunk these days.

The new code breaks out BulkScorer from Scorer, and it is necessary to create a
new BooleanScorer (a Scorer), just as you said. I'm afraid that we do have to
take Scorer instead as subScorer in the new BooleanScorer. And yes:
BooleanBulkScorer should not be embeded as its docIDs are out of order. My idea
is to keep BooleanBulkScorer just supporting no-MUST-clause case, and let the
new BooleanScorer to deal with the case where there is at least one MUST
clause. I think this is one of the best ways to be compatible with the current
design.

Besides, I'm afraid that the name of BulkScorer may be confusing. The new
BooleanScorer is also implemented by scoring a range of documents at once, but
it actually can act as Sub-Scorer.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-20 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942577#comment-13942577
 ] 

Da Huang commented on LUCENE-4396:
--

I'm afraid that if BooleanBulkScorer also handle MUST, it couldn't make use of 
.advance(), as its subScorers are BulkScorer which could not call .advance().

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-20 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942684#comment-13942684
 ] 

Da Huang commented on LUCENE-4396:
--

A new iteration on the proposal has just been submitted. The new iteration has 
added a part Supplementary Notes to describe how to fit my design to the new 
design on the current lucene trunk, such as renaming BooleanScorer to 
BooleanBulkScorer, creating a new BooleanScorer extended from Scorer.

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-19 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940342#comment-13940342
]

Da Huang commented on LUCENE-4396:
--

Hi, Mike.

I have just finished revising my proposal. I'm not so sure about the decription
on this page unless the MUST clauses have very low hit count compared to the
other clauses, that BooleanScorer would perform better than BooleanScorer2..

In my opinion, even when MUST clauses have very low hit count compared to the
other clauses, BooleanScorer is likely to perform better than BooleanScorer2,
because the calling on .advance() when dealing with SHOULD clauses can skip
documents as many as BooleanScorer2 does.

Relevant ideas is described in the session Improve the Rule for Choosing
Scorer. As it's not very consistent with the description on this page, I'm not
sure whether my idea makes sense.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-16 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937412#comment-13937412
]

Da Huang edited comment on LUCENE-4396 at 3/17/14 2:14 AM:
---

I'm revising and polishing my proposal these days, and I have discovered an
interesting thing. That is: if BooleanScorer supports required scorers in the
way I have proposed, docIDs would be in acsending order in the bucket table. I
think this can make BooleanScorer be a Not-Top Scorer, as .advance() .docID()
.nextDoc() etc. can be implemented. However, I'm not sure how it would affect
the performance when it acts as a Not-Top Scorer. This is because when
.nextDoc() or .advance() is called, BooleanScorer may calculate a 2K window
whose data may not be all useful.

I hope I have made my idea clear.

was (Author: dhuang):
I'm revising and polishing my proposal these days, and I have discovered a
interesting thing. That is: if BooleanScorer supports required scorers in the
way I have proposed, docIDs would be in acsending order in the bucket table. I
think this can make BooleanScorer be a Not-Top Scorer, as .advance() .docID()
.nextDoc() etc. can be implemented. However, I'm not sure how it would affect
the performance when it acts as a Not-Top Scorer. This is because when
.nextDoc() or .advance() is called, BooleanScorer may calculate a 2K window
whose data may not be all useful.

I hope I have made my idea clear.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-16 Thread Da Huang (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937412#comment-13937412
]

Da Huang commented on LUCENE-4396:
--

I'm revising and polishing my proposal these days, and I have discovered a
interesting thing. That is: if BooleanScorer supports required scorers in the
way I have proposed, docIDs would be in acsending order in the bucket table. I
think this can make BooleanScorer be a Not-Top Scorer, as .advance() .docID()
.nextDoc() etc. can be implemented. However, I'm not sure how it would affect
the performance when it acts as a Not-Top Scorer. This is because when
.nextDoc() or .advance() is called, BooleanScorer may calculate a 2K window
whose data may not be all useful.

I hope I have made my idea clear.

BooleanScorer should sometimes be used for MUST clauses
---

Key: LUCENE-4396
URL: https://issues.apache.org/jira/browse/LUCENE-4396
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

2014-03-14 Thread Da Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935929#comment-13935929
 ] 

Da Huang commented on LUCENE-4396:
--

I have just set the GSoC proposal's visibility as Public, and the public URL is 
this: 
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/dhuang/5629499534213120

 BooleanScorer should sometimes be used for MUST clauses
 ---

 Key: LUCENE-4396
 URL: https://issues.apache.org/jira/browse/LUCENE-4396
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
 If there is one or more MUST clauses we always use BooleanScorer2.
 But I suspect that unless the MUST clauses have very low hit count compared 
 to the other clauses, that BooleanScorer would perform better than 
 BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
 handle MUST so it shouldn't be hard to bring back this capability ... I think 
 the challenging part might be the heuristics on when to use which (likely we 
 would have to use firstDocID as proxy for total hit count).
 Likely we should also have BooleanScorer sometimes use .advance() on the subs 
 in this case, eg if suddenly the MUST clause skips 100 docs then you want 
 to .advance() all the SHOULD clauses.
 I won't have near term time to work on this so feel free to take it if you 
 are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [GoSC] I'm interested in LUCENE-3333

2014-03-09 Thread Da Huang

Hi, Mike.

You're right. After having a look at the comments on LUCENE-1518, I find
that my idea about that has many bugs. Sorry for that.

Thus, I have checked some other suggestions you gave me to see whether
relevant comments can be found in jira.

I think I have some idea on LUCENE-4396: BooleanScorer should sometimes be
used for MUST clauses.
Can we adjust the query to make the problem easier? For the query +a b c
+e +f as an example, maybe we can
turn it into (+a +e +f) b c which has only one MUST clause. Then, it
would be easier to judge which scorer to use?

Besides, I seems that the suggestion we should pass a needsScorers boolean
up-front to Weight.scorer
is not on jira. But it sounds that it can be done by adjusting some class
methods' arguments and return value
to pass the needsScorers? not sure.

At last, recently I find something strange in the code about heap. I find
heap has been implemented duplicately
for many times in the trunk, and a PriorityQueue is also implemented in the
package org.apache.lucene.util.
I remember java has already implemented the PriorityQueue. Why not use that?


Thanks,
Da Huang


-- 
黄达（Da Huang）
Team of Search Engine  Web Mining
School of Electronic Engineering  Computer Science
Peking University, Beijing, 100871, P.R.China

Re: [GoSC] I'm interested in LUCENE-3333

2014-03-09 Thread Da Huang

Thanks a lot. That's very helpful.

I think you get exactly what I mean about the LUCENE-4396.
By grouping up the MUST clauses, the conjunctive query can be
done specifiedly with easy way. Then, the original query would have
no more than 1 MUST clause. I think in this situation, it's much more
easier to judge whether to use BooleanScorer or BooleanScorer2. :)


Thanks,
Da Huang


-- 
黄达（Da Huang）
Team of Search Engine  Web Mining
School of Electronic Engineering  Computer Science
Peking University, Beijing, 100871, P.R.China

Re: [GoSC] I'm interested in LUCENE-3333

2014-03-08 Thread Da Huang

Hello, Mike.

I have spent some time considering your suggestions in last mail. I find
that I'm interested in the suggestion  Filter and Query should be more
'combined' .

In my opinion, to implement this suggestion, a new class FilterQuery,
which is a subclass of Query,  should be created. If FilterQuery is
implemented, then it can be the query element of BooleanClause, and the
BooleanQuery can naturally add a Filter as a BooleanClause. I think
one of the most important things is to deal with the scores, as Filter does
not contribute anything to score.

Above is my intuitive idea about this suggestion. Do you think it makes
sense? I hope I have made my idea clear.


Thanks,
Da Huang


-- 
黄达（Da Huang）
Team of Search Engine  Web Mining
School of Electronic Engineering  Computer Science
Peking University, Beijing, 100871, P.R.China

[GoSC] I'm interested in LUCENE-3333

2014-03-07 Thread Da Huang

Hello, everyone,

My name is Da Huang. I'm studying for my master degree of Computer Science
in Peking University. I have been using lucene for about half a year. It's
so elegent that I hope to have a chance to contribute some code for it.

Therefore, I have been scaned the jira GoSC 2014 Ideas page about lucene
for several days. I find LUCENE-: Specialize DisjunctionScorer if all
clauses are TermQueries more suitable for me to do. I have spent some time
to scan the revelant code, and the Issue LUCENE-3328 which spinoff
LUCENE-. I find the following questions confusing me.

1) I have checkout the code from 
http://svn.apache.org/repos/asf/lucene/dev/trunk lucene_trunk, but I
couldn't find the relevant code of the fixed Issue LUCENE-3328. It seems
that the patch attached on the page is not on the trunk. Why?

2)  My intuitive idea of solving this issue is to make a class
DisjunctionTermScorer to do the all TermQueries clauses; then, judging
whether to use DisjunctionTermScorer in the method 'scorer' in class
BooleanQuery. Is this idea right?

Above are my questions about LUCENE-. Besides, I would like to
propose the following issue which is about the QueryParser.

When we use QueryParser to parse a querystring like science AND
(engineering AND technology). The generated query would be
+science (+engineering +technology). I think it would be more efficient
for searching if the final query is +science +engineering +technology. My
idea is to make the cascaded AND and cascaded OR flat. Do you agree? I hope
I have made my idea clear.

Thanks,
Da Huang




-- 
黄达（Da Huang）
Team of Search Engine  Web Mining
School of Electronic Engineering  Computer Science
Peking University, Beijing, 100871, P.R.China

68 matches

Mail list logo