[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: tasks.cpp And.tasks A new tasks file, and the program which can generate it. In order to generate the tasks file with the program, you can run: {code} g++ tasks.cpp -std=c++0x -o tasks ./tasks wikimedium.10M.nostopwords.tasks And.tasks {code} BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396-simple.patch Oh, that's very unfortunate. It seems that the only choice is to recover the BS. In this patch, I've recovered the BS. Hope to have better perf. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100255#comment-14100255 ] Da Huang commented on LUCENE-4396: -- I've tested again with the setup exactly same as mike's. Here's the performance. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighSpanNear1.05 (2.1%)1.04 (2.1%) -1.6% ( -5% -2%) HighSloppyPhrase3.83 (5.3%)3.78 (4.9%) -1.3% ( -10% -9%) LowTerm 78.04 (4.5%) 77.13 (4.5%) -1.2% ( -9% -8%) MedSpanNear2.89 (3.9%)2.86 (3.3%) -1.1% ( -8% -6%) LowSpanNear5.91 (4.9%)5.84 (4.2%) -1.1% ( -9% -8%) HighTerm8.02 (12.1%)7.94 (11.4%) -1.0% ( -21% - 25%) AndHighHigh9.84 (1.9%)9.74 (2.4%) -1.0% ( -5% -3%) MedTerm 30.63 (4.7%) 30.35 (5.1%) -0.9% ( -10% -9%) LowSloppyPhrase5.83 (4.4%)5.79 (4.5%) -0.7% ( -9% -8%) MedSloppyPhrase 16.86 (4.5%) 16.75 (4.3%) -0.6% ( -9% -8%) OrHighMed7.57 (4.5%)7.55 (4.1%) -0.3% ( -8% -8%) OrNotHighLow7.87 (5.3%)7.84 (5.3%) -0.3% ( -10% - 10%) AndHighMed 25.10 (3.1%) 25.05 (3.7%) -0.2% ( -6% -6%) Fuzzy2 10.80 (2.7%) 10.78 (2.9%) -0.1% ( -5% -5%) OrHighHigh8.75 (4.4%)8.74 (4.1%) -0.1% ( -8% -8%) OrHighNotMed7.33 (4.4%)7.33 (4.0%) -0.1% ( -8% -8%) OrNotHighHigh4.84 (5.1%)4.84 (5.0%) -0.1% ( -9% - 10%) OrHighLow6.67 (4.6%)6.66 (4.5%) -0.1% ( -8% -9%) OrNotHighMed2.90 (5.2%)2.89 (5.2%) -0.1% ( -10% - 10%) OrHighNotHigh2.32 (4.9%)2.32 (4.6%) -0.0% ( -9% -9%) Fuzzy1 20.35 (3.1%) 20.38 (3.4%) 0.1% ( -6% -6%) OrHighNotLow 13.54 (4.5%) 13.56 (4.2%) 0.2% ( -8% -9%) MedPhrase 11.75 (3.2%) 11.78 (2.4%) 0.2% ( -5% -5%) LowPhrase6.08 (2.9%)6.09 (2.7%) 0.2% ( -5% -6%) HighPhrase 13.25 (3.8%) 13.29 (3.4%) 0.3% ( -6% -7%) Prefix3 19.78 (3.2%) 19.85 (3.9%) 0.4% ( -6% -7%) Respell 15.13 (3.1%) 15.19 (3.7%) 0.4% ( -6% -7%) Wildcard8.82 (3.3%)8.89 (4.9%) 0.8% ( -7% -9%) IntNRQ0.85 (4.2%)0.86 (6.0%) 1.3% ( -8% - 12%) AndHighLow 172.85 (4.9%) 175.57 (4.7%) 1.6% ( -7% - 11%) {code} BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396-simple.patch I have rebase the patch to the recent git mirror commit 9069570eba29b3270bf5232f4fc8f6a156ff66d1 . Besides, I've optimized the BooleanScorerCollector to make the coord calculated in the constructor. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: merge-simple.png LUCENE-4396-simple.patch merge-simple.perf This is a patch based on git mirror commit 67d17eb81b754fa242bb91e1b91070fd8b38ecd9 . In this patch, I simplify the logics of choosing scorers. I think the logic is quite simple and intuitive now. [^merge-simple.perf] is its original performance data. You can also refer to the following figures. !merge-simple.png! BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098565#comment-14098565 ] Da Huang commented on LUCENE-4396: -- Thanks for your suggestions, Mike! {quote} I'm worried about how BooleanWeight.bulkScorer first pulls BulkScorer for the clauses, and then sometimes also pulls Scorer; pulling a Scorer is not that cheap an operation in general. {quote} My current plan is to break from the first weights iteration when it comes to a required scorer. In this way, I'm sure that the times it pulls scorers is exactly the same as the trunk does. {quote} Maybe if we added .cost() to bulk scorer we could avoid that? {quote} I don't think so. When the logics choose DAAT but not BS, it has to wrap up to super.bulkScorer() and pulls all scorers again. {quote} Or maybe we could look at the BulkScorer, and if it's a DefaultBulkScorer, just ask it for the Scorer it wrapped? {quote} This way may make it embarrassed when it's not a DefaultBulkScorer. but not sure. I will have a try. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396-simple.patch In this patch, I make BS's classes to be static, and adjust the scorers choosing logics so that the times it pull scorers is exactly the same as the trunk does. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396-simple.patch, LUCENE-4396-simple.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge-simple.perf, merge-simple.png, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096845#comment-14096845 ] Da Huang commented on LUCENE-4396: -- {quote} it looks like we are down to one new added scorer {quote} Yes, we just have only one added scorer now. {quote} I wonder if we can somehow simplify that decision process? {quote} Yea, I agree. The current choosing logics is indeed too tricky. I'm going to find a more simple and intuitive way. I think the perf. figures showed in perf.png is still the most important reference. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch This is a patch based on git mirror commit 67d17eb81b754fa242bb91e1b91070fd8b38ecd9 . In this patch, I added test cases to make sure score calculated by BS, BAS and DAAT are same. Besides, I have deleted the unused logics and added comments. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092678#comment-14092678 ] Da Huang commented on LUCENE-4396: -- Thanks for your sugggestions, Mike ! {quote} Is it possible to make a test case showing what the bug was, and that's fixed (and stays fixed)? {quote} The current test cases can show the bug, if you uncomment this line: {code} // scorerOrClass = BooleanArrayScorer.class; {code} {quote} Also, do we have a test case that fails if DAAT and TAAT scoring differs (as it does on trunk today)? {quote} Negative. I'll add the test case to the next patch. {quote} Can you add a comment to that part in the code, linking to this issue and explaining the motivation behind it? {quote} Sure. {quote} Can I commit TestBooleanUnevenly to trunk today? Seems like there's no reason to wait... {quote} Yes, sure. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, merge.perf, merge.png, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch This is a patch based on git mirror commit d707f783ab068b70752a3f9cfdc0dabb7f4fbadf . In this patch, I tried to fix the .getChildren() problem in BAS and BLS. I have tried to make .bulkScorer() choose DAAT, when scoreDocsInOrder is true. However, I discovered that I have to copy the scorer choosing logics to .scoreDocsOutOfOrder() to make things right. I have also tried to implement the .getChildren() method for BAS and BLS, but the TAAT strategy will make scorers exhausted at the beginning. Finally, I just throw UnsupportedOperationException in BAS.getChildren() and BLS.getChildren(). Besides, I have run more tests to make sure everything is right. As you can see, the performance of HighAnd.\*Low.\* cases showed in merge.png is not good. Therefore, I ran HighAnd.\*Low.\* cases with luceneutil's pattern filter, and the result is as follows. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAnd6LowOr9.44 (6.4%)9.19 (4.8%) -2.6% ( -12% -9%) HighAnd5LowOr9.00 (8.8%)8.85 (7.4%) -1.6% ( -16% - 16%) HighAnd3LowOr 11.89 (8.9%) 11.71 (7.8%) -1.6% ( -16% - 16%) HighAnd4LowOr 10.78 (7.4%) 10.61 (6.3%) -1.5% ( -14% - 13%) HighAnd7LowOr9.08 (7.2%)8.94 (5.8%) -1.5% ( -13% - 12%) HighAnd8LowOr6.32 (8.6%)6.23 (6.9%) -1.4% ( -15% - 15%) HighAnd9LowOr5.71 (5.7%)5.65 (4.5%) -1.1% ( -10% -9%) PKLookup 98.95 (4.5%) 98.38 (2.4%) -0.6% ( -7% -6%) HighAnd9LowNot7.49 (3.7%)7.46 (3.2%) -0.4% ( -7% -6%) HighAnd4LowNot 10.33 (6.4%) 10.31 (6.1%) -0.2% ( -11% - 13%) HighAnd8LowNot6.69 (5.3%)6.70 (4.9%) 0.1% ( -9% - 10%) HighAnd7LowNot6.82 (5.1%)6.84 (5.0%) 0.3% ( -9% - 10%) HighAnd6LowNot9.45 (5.5%)9.48 (4.7%) 0.3% ( -9% - 11%) HighAnd3LowNot 10.80 (6.7%) 10.87 (6.1%) 0.6% ( -11% - 14%) HighAnd5LowNot4.28 (7.4%)4.32 (7.1%) 1.0% ( -12% - 16%) {code} Everything looks right. I have also run tests for more complicate tasks. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff LowAnd6LowOr6LowNot 31.59 (1.0%) 28.52 (2.4%) -9.7% ( -12% - -6%) HighAnd6LowOr6LowNot6.10 (2.7%)5.76 (4.0%) -5.6% ( -11% -1%) MedAnd6LowOr6LowNot7.33 (2.3%)7.03 (3.1%) -4.0% ( -9% -1%) HighAnd6MedOr6LowNot3.51 (1.5%)3.49 (2.6%) -0.6% ( -4% -3%) PKLookup 95.99 (5.1%) 95.48 (4.9%) -0.5% ( -10% -9%) HighAnd6MedOr6MedNot1.96 (1.3%)1.97 (2.5%) 0.4% ( -3% -4%) MedAnd6MedOr6MedNot2.34 (1.2%)2.35 (2.3%) 0.5% ( -2% -4%) HighAnd6LowOr6HighNot1.31 (1.1%)1.33 (2.4%) 0.9% ( -2% -4%) HighAnd6LowOr6MedNot3.08 (1.5%)3.12 (2.7%) 1.2% ( -2% -5%) MedAnd6LowOr6MedNot3.72 (1.4%)3.89 (2.6%) 4.8% ( 0% -8%) HighAnd6MedOr6HighNot1.40 (1.0%)1.53 (2.4%) 9.3% ( 5% - 12%) LowAnd6LowOr6MedNot9.23 (2.1%) 10.19 (2.7%) 10.4% ( 5% - 15%) LowAnd6LowOr6HighNot6.04 (2.5%)6.74 (2.9%) 11.6% ( 6% - 17%) LowAnd6HighOr6HighNot4.15 (3.4%)4.72 (4.2%) 13.8% ( 5% - 22%) MedAnd6MedOr6HighNot1.65 (1.2%)1.91 (2.2%) 15.7% ( 12% - 19%) MedAnd6LowOr6HighNot2.42 (1.7%)2.80 (2.7%) 16.0% ( 11% - 20%) LowAnd6HighOr6LowNot4.69 (2.9%)5.45 (3.7%) 16.1% ( 9% - 23%) MedAnd6MedOr6LowNot3.45 (1.2%)4.04 (2.1%) 17.1% ( 13% - 20%) LowAnd6MedOr6LowNot8.77 (1.6%) 10.38 (2.4%) 18.4% ( 14% - 22%) LowAnd6MedOr6MedNot6.36 (2.6%)7.55 (3.5%) 18.6% ( 12% - 25%) LowAnd6MedOr6HighNot5.48 (3.1%)6.51 (3.9%) 18.8% ( 11% - 26%) LowAnd6HighOr6MedNot5.77 (3.1%)6.86 (4.3%) 18.9
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: perf.png BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, perf.png, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: merge.png merge.perf LUCENE-4396.patch This is a patch based on git mirror commit a5a2e716ebcba1a201c4934f336ae9c0fcb551bf . In this patch, I have fixed a bug of wrong coord counting. Besides, I have come up with an awesome idea on choosing scorer and implemented this idea in this patch. The following is the story of this idea. After I have completed performance tests for each scorer, I plotted figures for the results, so that I can have an intuitive view on natures of each scorers. The figures are showed as follows. !perf.png! Then, I discovered that the performance of each scorer can probably be fitted by a straight line. It may be confusing that there're several points which look distinctive, such as (8, -20) on BAS, in LowAndNLowOr case. However, when I retested BAS again, its performance went to 10 with N = 8. Therefore, I just consider these 'distinctive' points as noisy points. So, the following things to do is to get those performance curves' expressions. Firstly, just have a look at the perf. figures again. We can find that BAS can get the best performance on average, so we just discuss BAS here. The expressions of each performance curve' fitting line are showed as follows. (Suppose that the horizontal axis is 'x', while the vertical axis is 'y') || LowAndNLowOr || LowAndNHighOr || HighAndNLowOr || HighAndNHighOr || | y = 5.33x - 31 | y = 3.83x - 8.5 | y = 1.67x - 55 | y = 7.5x - 32.5 | || LowAndNLowNot || LowAndNHighNot || HighAndNLowNot || HighAndNHighNot || | y = 4.5x - 18.5 | y = 3x - 7 | y = -0.83x - 22.5 | y = 7x - 31 | I got these expressions just by visual estimation. You can also get similiar expressions by drawing a straight line between the first and last point on each firgure. Now suppose that the general performance expression is y = A \* x + B . In lucene/BooleanQuery, the only information we have is requiredCost and optionalCost(or prohibitedCost). For convenience, Let's just symbolize these two values as 'a' and 'b' respectively. If we can find two functions, f and g, which have A = f(a, b), B = g(a, b), we can get the performance curve in the program. For convenience, we just discuss A = f(a, b) here, and the case of B is just similiar to A. The same, we just discuss \*Or cases here, and \*Not cases are just similiar ones. Here, we know the values of a and b for each case. || LowAndNLowOr || LowAndNHighOr || HighAndNLowOr || HighAndNHighOr || | a = L, b = L | a = L, b = H| a = H, b = L| a = H, b = H | Among, L represents a low cost, while H represents a high cost. We can evaluate these two value by doing a statistics on wikimedium.10M.nostopwords.tasks in luceneutil. Here, their evaluated values are: {code} H = 747310, L = 34750 {code} As, you can see, the values of H and L are too high. Here, we get their log value; that is {code} h = log(H), l = log(L) {code} Suppose that f is formatted as {code} f(a,b) = k1 * u1(a, b) + k2 * u2(a, b) + k3 * u3(a, b) + k4 * u4(a, b) {code} Thus, we have {code} k1 * u1(l, l) + k2 * u2(l, l) + k3 * u3(l, l) + k4 * u4(l, l) = 5.33 k1 * u1(l, h) + k2 * u2(l, h) + k3 * u3(l, h) + k4 * u4(l, h) = 3.84 k1 * u1(h, l) + k2 * u2(h, l) + k3 * u3(h, l) + k4 * u4(h, l) = 1.67 k1 * u1(h, h) + k2 * u2(h, h) + k3 * u3(h, h) + k4 * u4(h, h) = 7.5 {code} The following question is how to choose ui(a, b). Actually, I have tried many formulations, and I found the following is the best. ||u1(a,b)||u2(a,b)||u3(a,b)||u4(a,b)|| |a |b | a\*b | a\*b/(a+b)| I think such setup has its physical meanings. u1 and u2 are the influence factors of a and b respectively. u3 represents the higher dimentional factor. u4 is half of the harmonic mean. Thus, we have {code} [ l l l*l l ] [ k1 ] [5.33] [ l h l*h l*h/(l+h) ] * [ k2 ] = [3.84] [ h l h*l l*h/(l+h) ] [ k3 ] [1.67] [ h h h*h h ] [ k4 ] [7.5 ] that is [ 10.4559 10.4559 109.3266 5.2280 ] [ 10.4559 13.5242 141.4085 5.8969 ] * [k] = [A] [ 13.5242 10.4559 141.4085 5.8969 ] [ 13.5242 13.5242 182.9049 6.7621 ] or symbolized [U] * [k] = [A] {code} Luckily, \[U\] is a good matrix, which means that its inverse matrix is 'calculable'. {code} [ -1.43651.11061.4365 -1.1106 ] inv([U]) = [ -1.43651.43651.1106 -1.1106 ] [ -0.03120.0.0.0241 ] [ 6.5893 -5.0943 -5.09433.9386 ] [k] = inv([U]) * [A] = [-9.3396 -8.6334 0.0145 36.6630]' {code} Now, in the program, we can get A by, {code} A = f(a,b) = k1 * a + k2 * b + k3 * a * b + k4 * a * b / (a + b) {code} and get B in a similiar way. Finally, we get the evaluated fitting straight line of BAS in a specific case. y
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086295#comment-14086295 ] Da Huang commented on LUCENE-4396: -- Thanks for Mike and Paul's suggestions. {quote} It's a bit spooky that collectMore recurses on itself; in theory there's an adversary that could consume quite a bit of stack right? Can we refactor that to the equivalent while loop (it's just tail recursion). {quote} Ok. Doing collectMore without recursion is easy. {quote} Unfortunately the logic for picking which scorer to use looks really complex; hopefully we can simplify it. Also, do we really need 3 scorer classes (BS, BAS, BLS) for the non-DAAT case? Ie, does each really provide a compelling situation where it's better than the others? {quote} Actually, these scorers are still very competitive when clauses are much fewer. I have done some tests today. Results are as follows. {code} Taskarray bs ll HighAnd10HighNot34.7 31.7 53.7* HighAnd10HighOr27.0+-0.9 32.6* HighAnd10LowNot -33.5 -6.5-17.5 HighAnd10LowOr -36.7 -2.1-43.4 HighAnd5HighNot 3.0-10.0 16.8* HighAnd5HighOr -11.5 -2.0 -4.2 HighAnd5LowNot -44.5 -9.4-36.6 HighAnd5LowOr -56.2 -3.4-61.9 LowAnd10HighNot18.2+18.7+20.0* LowAnd10HighOr21.2+26.1* 6.0 LowAnd10LowNot19.6*-2.4 5.7 LowAnd10LowOr13.3*-2.7-11.1 LowAnd5HighNot 9.1* 6.9+-2.7 LowAnd5HighOr 7.6 12.5*-9.3 LowAnd5LowNot-0.9 -4.0-11.0 LowAnd5LowOr-7.5 -3.6-27.2 Task Good Method HighAnd10HighNot ll, HighAnd10HighOr ll, array, HighAnd10LowNot HighAnd10LowOr HighAnd5HighNot ll, HighAnd5HighOr HighAnd5LowNot HighAnd5LowOr LowAnd10HighNot ll, bs, array, LowAnd10HighOr bs, array, LowAnd10LowNot array, LowAnd10LowOr array, LowAnd5HighNot array, bs, LowAnd5HighOr bs, LowAnd5LowNot LowAnd5LowOr Taskarray bs ll HighAnd25HighNot75.3 79.1131.5* HighAnd25HighOr69.8+74.2+80.8* HighAnd25LowNot-1.0 3.8 15.7* HighAnd25LowOr-3.8-34.7 -9.1 HighAnd5HighNot 8.1*-2.9 7.8+ HighAnd5HighOr-1.6 -4.1-12.9 HighAnd5LowNot -37.3-33.3-39.1 HighAnd5LowOr -60.8-42.5-60.7 LowAnd25HighNot38.9 40.1 79.4* LowAnd25HighOr44.8*40.2+23.5 LowAnd25LowNot52.7+55.7*39.2+ LowAnd25LowOr51.1*50.8+23.7 LowAnd5HighNot10.0+12.0*-2.7 LowAnd5HighOr 5.0 8.0*-9.9 LowAnd5LowNot 2.6 4.1* -10.1 LowAnd5LowOr-8.8 -5.1-29.1 Task Good Method HighAnd25HighNot ll, HighAnd25HighOr ll, bs, array, HighAnd25LowNot ll, HighAnd25LowOr HighAnd5HighNot array, ll, HighAnd5HighOr HighAnd5LowNot HighAnd5LowOr LowAnd25HighNot ll, LowAnd25HighOr array, bs, LowAnd25LowNot bs, array, ll, LowAnd25LowOr array, bs, LowAnd5HighNot bs, array, LowAnd5HighOr bs, LowAnd5LowNot bs, LowAnd5LowOr {code} Now, I'm just using BAS and BLS for cases with MUST, as BS's perfermance is not very competitive. Even though BS seems to be a compelling choice for the case LowAnd5HighOr, its superiority to BAS is not huge. Besides, BS can make the logics even more complicate, as BS is BulkScorer while others are Scorer. If we still need to give up one scorer, I think it would be better to give up BLS, as it looks that BAS to have more positive value than BLS. {quote} It's not great adding so much complexity for performance gains of unusual (so many clauses) boolean queries... {quote} I'm going to just focus the \*And5\* \*And10\* cases to optimize the perf. If 10 clauses are still too many, I will just focus on the \*And5\* cases. Besides, today
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: tasks.cpp LUCENE-4396.patch And.tasks The patch based on git mirror commit 67d17eb81b754fa242bb91e1b91070fd8b38ecd9 . In this patch, I remove those unused classes, encapsulate some functions and fix some bugs. Besides, the tasks file used before has heavy relevance between cases. I think it's not good. Therefore, I generate a new tasks file. The file And.tasks is the new tasks file, while 'tasks.cpp' is the program to generate this tasks file. You can generate tasks file by running {code} g++ tasks.cpp -std=c++0x -o tasks ./tasks wikimedium.10M.nostopwords.tasks And.tasks {code} The perf. on the new tasks file is as follows. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAnd5LowNot5.40 (5.1%)4.88 (4.2%) -9.6% ( -18% -0%) HighAnd5LowOr7.05 (10.2%)6.87 (3.8%) -2.6% ( -15% - 12%) LowAnd5LowNot 27.17 (2.1%) 26.47 (2.6%) -2.6% ( -7% -2%) HighAnd5HighOr1.13 (3.8%)1.11 (2.2%) -1.8% ( -7% -4%) LowAnd5LowOr 31.82 (2.6%) 31.35 (2.3%) -1.5% ( -6% -3%) PKLookup 98.80 (5.2%) 102.02 (6.3%) 3.3% ( -7% - 15%) HighAnd5HighNot1.95 (1.0%)2.04 (2.1%) 4.7% ( 1% -7%) LowAnd5HighNot9.46 (2.9%) 10.32 (2.7%) 9.0% ( 3% - 15%) LowAnd5HighOr7.56 (2.8%)8.42 (2.8%) 11.4% ( 5% - 17%) LowAnd60HighOr0.51 (2.5%)0.82 (4.8%) 58.7% ( 50% - 67%) LowAnd60LowNot2.61 (1.0%)4.64 (3.4%) 78.0% ( 72% - 83%) HighAnd60LowNot1.30 (1.2%)2.36 (3.7%) 81.1% ( 75% - 87%) HighAnd60LowOr1.18 (1.3%)2.15 (3.7%) 82.0% ( 76% - 88%) LowAnd60LowOr2.25 (0.6%)4.61 (4.2%) 104.7% ( 99% - 110%) HighAnd60HighOr0.10 (0.7%)0.26 (4.8%) 151.2% ( 144% - 157%) LowAnd60HighNot0.53 (2.5%)1.62 (8.0%) 204.0% ( 188% - 220%) HighAnd60HighNot0.14 (0.9%)0.59 (8.9%) 328.4% ( 315% - 341%) {code} My next step is to do more tests to get better rules and make sure the correctness. I think it can be finished by this Friday. As the suggested pencil down date is comming, I will begin to scrub the code, improve the comments, and write document in conclusion. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084332#comment-14084332 ] Da Huang commented on LUCENE-4396: -- Hi, [~paul.elsc...@xs4all.nl]. The commit hash code mentioned here just indicates which commit the patch should apply on. If you want to get the java latest code discussed here for example, you can do these {code} git clone https://github.com/apache/lucene-solr cd lucene-solr git checkout 67d17eb81b754fa242bb91e1b91070fd8b38ecd9 git apply LUCENE-4396.patch {code} LUCENE-4396.patch is attached on this page, you can download it first. Hope this can help you. btw, there is a repo where I'm maintaining the code, but the repo is on the server in my lab. You're not able to clone from that repo without password. Sorry for that. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp, stat.cpp, tasks.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch This is a patch based on git mirror commit 67d17eb81b754fa242bb91e1b91070fd8b38ecd9 In this patch, I go further based on the last patch. Firstly, I move all scorer choosing logics to .bulkScorer(), so that there's no need to wrap scorer in .bulkScorer(). Secondly, I have tried to use BooleanScorer for some cases with MUST. However, it seems that there's something wrong with my test on BS before. The perf. of BS can just beat DAAT on 2 cases, and BS perfs worse than other explored scorers on these 2 cases. Ther perf of BQ(the merged scorer) and BS is showed as follows. {code} BQ TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowNot5.01 (3.5%)4.29 (2.7%) -14.3% ( -19% - -8%) HighAndSomeLowNot 15.33 (5.1%) 13.71 (5.4%) -10.6% ( -20% -0%) LowAndSomeLowOr 240.72 (2.5%) 217.73 (2.5%) -9.6% ( -14% - -4%) LowAndSomeLowNot 269.51 (1.4%) 244.76 (2.3%) -9.2% ( -12% - -5%) HighAndTonsLowOr5.19 (5.3%)4.94 (2.0%) -4.8% ( -11% -2%) HighAndSomeHighNot1.60 (2.0%)1.57 (2.6%) -1.9% ( -6% -2%) HighAndSomeLowOr6.65 (11.5%)6.77 (4.1%) 1.8% ( -12% - 19%) PKLookup 96.93 (2.3%) 99.72 (4.1%) 2.9% ( -3% -9%) LowAndSomeHighNot 59.45 (1.5%) 61.63 (2.4%) 3.7% ( 0% -7%) LowAndSomeHighOr 40.78 (2.0%) 42.75 (3.0%) 4.8% ( 0% - 10%) HighAndSomeHighOr2.11 (2.8%)2.44 (3.0%) 16.1% ( 10% - 22%) LowAndTonsLowNot 17.45 (1.3%) 20.88 (2.5%) 19.6% ( 15% - 23%) LowAndTonsHighOr2.76 (1.6%)3.34 (3.1%) 21.0% ( 16% - 26%) LowAndTonsLowOr 15.36 (1.2%) 19.83 (3.1%) 29.2% ( 24% - 33%) HighAndTonsHighOr0.08 (0.7%)0.21 (5.1%) 159.8% ( 152% - 166%) LowAndTonsHighNot1.69 (1.5%)5.14 (5.9%) 204.0% ( 193% - 214%) HighAndTonsHighNot0.09 (0.7%)0.41 (11.0%) 359.9% ( 345% - 374%) BooleanScorer TaskQPS baseline StdDevQPS my_version StdDev Pct diff LowAndSomeHighOr 51.38 (1.7%)1.47 (0.4%) -97.1% ( -97% - -96%) LowAndTonsHighOr2.79 (1.5%)0.10 (0.5%) -96.5% ( -97% - -95%) LowAndTonsHighNot1.71 (2.0%)0.17 (0.7%) -90.3% ( -91% - -89%) LowAndSomeHighNot 32.69 (2.2%)3.18 (0.6%) -90.3% ( -91% - -89%) LowAndSomeLowOr 258.50 (1.7%) 91.84 (1.6%) -64.5% ( -66% - -62%) HighAndSomeLowOr 12.66 (9.1%)5.89 (2.3%) -53.5% ( -59% - -46%) LowAndSomeLowNot 252.33 (2.1%) 124.57 (1.1%) -50.6% ( -52% - -48%) HighAndTonsLowOr3.13 (7.5%)1.57 (2.3%) -49.7% ( -55% - -43%) LowAndTonsLowOr 14.17 (0.8%)7.32 (2.6%) -48.4% ( -51% - -45%) HighAndSomeLowNot 18.01 (5.6%) 10.03 (2.8%) -44.3% ( -49% - -37%) LowAndTonsLowNot 17.17 (1.1%) 11.33 (1.5%) -34.0% ( -36% - -31%) HighAndTonsLowNot6.29 (2.5%)4.73 (2.4%) -24.9% ( -29% - -20%) HighAndSomeHighOr1.66 (3.1%)1.28 (7.5%) -22.7% ( -32% - -12%) HighAndSomeHighNot2.11 (1.4%)1.83 (3.4%) -13.5% ( -18% - -8%) PKLookup 96.92 (4.0%) 94.94 (2.5%) -2.0% ( -8% -4%) HighAndTonsHighOr0.07 (0.5%)0.09 (18.2%) 38.3% ( 19% - 57%) HighAndTonsHighNot0.04 (1.9%)0.16 (24.4%) 263.0% ( 232% - 294%) {code} By the perf. table of BQ, it looks that BQ perfs low on the first 4 cases. However, when I run these cases one by one, they're just worse than the trunk within 2%. I'm not sure what makes this happen? BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch This is a patch based on git mirror commit ce7d0578b30981d15687bf76aec595274efccbad I've tried to merge all explored methods to get a better performance for boolean retrieval. In this patch, I just mix methods in BooleanQuery.BooleanWeight.scorer() I have tried to mix methods in .bulkScorer(), but it fails to pass the ant-test. It took me lots of time to figure out the cause. It turned out that I'm not supposed to call w.bulkScorer() to get optional scorer, as well as prohibited scorer, in BooleanQuery.BooleanWeight.bulkScorer(), or the TestBooleanScorer.testEmbeddedBooleanScorer will throws an UnsupportedOperationException because it calls an unimplemented .scorer() method. It makes me embarrassed that I'm not able to get the cost of a scorer without an instance of Scorer. Therefore, my next step is to check whether I can get optional scorer in .bulkScorer(). If yes, do the similar things as .scorer(). If no, just call BooleanScorer(); Besides, I'm very sorry that the code in this patch may looks ugly, as I haven't spared enough time to rearrange the code. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowNot4.06 (4.0%)3.44 (5.1%) -15.5% ( -23% - -6%) HighAndSomeLowNot 17.02 (5.3%) 15.61 (9.2%) -8.3% ( -21% -6%) HighAndTonsLowOr5.82 (5.0%)5.67 (1.5%) -2.6% ( -8% -4%) LowAndSomeHighOr 55.03 (3.0%) 54.39 (2.2%) -1.2% ( -6% -4%) HighAndSomeHighNot1.24 (2.3%)1.23 (2.3%) -1.0% ( -5% -3%) LowAndSomeLowOr 231.48 (1.8%) 229.47 (2.1%) -0.9% ( -4% -3%) PKLookup 97.60 (2.1%) 97.63 (2.2%) 0.0% ( -4% -4%) LowAndSomeLowNot 312.07 (2.0%) 312.28 (2.1%) 0.1% ( -3% -4%) HighAndSomeHighOr1.69 (2.6%)1.69 (1.2%) 0.4% ( -3% -4%) HighAndSomeLowOr 14.28 (11.7%) 14.81 (4.7%) 3.7% ( -11% - 22%) LowAndSomeHighNot 34.74 (2.9%) 36.83 (2.6%) 6.0% ( 0% - 11%) LowAndTonsHighOr2.34 (2.7%)2.90 (3.2%) 24.3% ( 17% - 30%) LowAndTonsLowOr 18.88 (1.0%) 25.14 (3.0%) 33.2% ( 28% - 37%) LowAndTonsLowNot 15.78 (1.4%) 22.29 (2.0%) 41.2% ( 37% - 45%) HighAndTonsHighOr0.06 (0.6%)0.17 (5.8%) 179.9% ( 172% - 187%) LowAndTonsHighNot1.33 (2.4%)4.29 (8.1%) 223.5% ( 207% - 239%) HighAndTonsHighNot0.06 (1.8%)0.34 (17.3%) 495.0% ( 467% - 523%) {code} BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp, stat.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072693#comment-14072693 ] Da Huang edited comment on LUCENE-4396 at 7/24/14 8:53 AM: --- This patch is based on the git mirror commit ce7d0578b30981d15687bf76aec595274efccbad . This is the first try to merge scorers, so that we can get a better perf of boolean retrieval. I create a new class named BooleanMixedScorerDecider to choose the best scorer. Rules for choosing remains to be improved. I have been working on it to find an elegant way to define rules. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndSomeLowNot 11.53 (7.3%) 10.75 (10.1%) -6.8% ( -22% - 11%) HighAndTonsLowNot4.87 (4.0%)4.64 (6.0%) -4.9% ( -14% -5%) LowAndSomeLowOr 306.20 (2.2%) 299.06 (2.8%) -2.3% ( -7% -2%) HighAndSomeLowOr 13.67 (9.4%) 13.38 (2.7%) -2.1% ( -13% - 11%) HighAndTonsLowOr4.04 (6.4%)3.96 (1.9%) -1.9% ( -9% -6%) LowAndSomeLowNot 215.18 (1.9%) 211.14 (2.2%) -1.9% ( -5% -2%) PKLookup 96.26 (2.3%) 94.56 (2.8%) -1.8% ( -6% -3%) HighAndTonsHighNot0.06 (2.3%)0.06 (2.6%) -1.0% ( -5% -4%) HighAndTonsHighOr0.06 (0.6%)0.06 (1.3%) 0.9% ( 0% -2%) HighAndSomeHighNot1.59 (2.2%)1.62 (2.9%) 1.7% ( -3% -6%) LowAndSomeHighNot 66.33 (2.1%) 68.77 (2.1%) 3.7% ( 0% -8%) LowAndSomeHighOr 53.75 (1.6%) 56.86 (2.1%) 5.8% ( 1% -9%) LowAndTonsLowNot 14.00 (1.7%) 14.84 (1.5%) 6.1% ( 2% -9%) HighAndSomeHighOr2.39 (2.2%)2.68 (3.5%) 12.4% ( 6% - 18%) LowAndTonsLowOr 17.69 (0.9%) 21.64 (1.7%) 22.3% ( 19% - 25%) LowAndTonsHighOr1.83 (1.3%)2.33 (2.4%) 27.2% ( 23% - 31%) LowAndTonsHighNot1.15 (1.5%)1.51 (3.1%) 30.9% ( 25% - 36%) {code} was (Author: dhuang): This is the first try to merge scorers, so that we can get a better perf of boolean retrieval. I create a new class named BooleanMixedScorerDecider to choose the best scorer. Rules for choosing remains to be improved. I have been working on it to find an elegant way to define rules. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndSomeLowNot 11.53 (7.3%) 10.75 (10.1%) -6.8% ( -22% - 11%) HighAndTonsLowNot4.87 (4.0%)4.64 (6.0%) -4.9% ( -14% -5%) LowAndSomeLowOr 306.20 (2.2%) 299.06 (2.8%) -2.3% ( -7% -2%) HighAndSomeLowOr 13.67 (9.4%) 13.38 (2.7%) -2.1% ( -13% - 11%) HighAndTonsLowOr4.04 (6.4%)3.96 (1.9%) -1.9% ( -9% -6%) LowAndSomeLowNot 215.18 (1.9%) 211.14 (2.2%) -1.9% ( -5% -2%) PKLookup 96.26 (2.3%) 94.56 (2.8%) -1.8% ( -6% -3%) HighAndTonsHighNot0.06 (2.3%)0.06 (2.6%) -1.0% ( -5% -4%) HighAndTonsHighOr0.06 (0.6%)0.06 (1.3%) 0.9% ( 0% -2%) HighAndSomeHighNot1.59 (2.2%)1.62 (2.9%) 1.7% ( -3% -6%) LowAndSomeHighNot 66.33 (2.1%) 68.77 (2.1%) 3.7% ( 0% -8%) LowAndSomeHighOr 53.75 (1.6%) 56.86 (2.1%) 5.8% ( 1% -9%) LowAndTonsLowNot 14.00 (1.7%) 14.84 (1.5%) 6.1% ( 2% -9%) HighAndSomeHighOr2.39 (2.2%)2.68 (3.5%) 12.4% ( 6% - 18%) LowAndTonsLowOr 17.69 (0.9%) 21.64 (1.7%) 22.3% ( 19% - 25%) LowAndTonsHighOr1.83 (1.3%)2.33 (2.4%) 27.2% ( 23% - 31%) LowAndTonsHighNot1.15 (1.5%)1.51 (3.1%) 30.9% ( 25% - 36%) {code} BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074083#comment-14074083 ] Da Huang commented on LUCENE-4396: -- {quote} Do we really need a separate class to make the decision about which scorer to use? Seems like the added logic for when to use BNS is fairly small so we could just add it into BQ's scorer method instead? {quote} OK, I will move the decision logic back to BQ. {quote} For bulkScorer, should we ever return BooleanScorer even when there are required clauses? Or was that just commented out for temporary benchmarking so we'd wrap BNS? When there is a required clause, if BNS is never slower than BS, then instead of falling back to super.bulkScorer we could do the wrapping ourselves there? Just to make it clearer we are using BNS ... or maybe just put a comment saying so (replacing that TODO). {quote} BooleanScorer should be applied for bulkScorer under some cases. Now I turn to super.bulkScorer when there are required clauses is just a temporary strategy. See the following tables. {code} Task ArrayNotDel BS BitSet ll llbssize5size8size9 HighAndSomeHighNot 0.7 15.3* 7.4 8.9 2.0 6.6 10.0 3.4 HighAndSomeHighOr13.3 24.5* 7.8 9.1 10.9 17.3+18.3+21.3+ HighAndSomeLowNot -45.1-53.9-55.0-57.3 -45.5-47.8-42.2-41.5 HighAndSomeLowOr -44.7-55.4-51.2-58.1 -54.5-47.9-39.7-44.9 HighAndTonsHighNot 475.7+ 472.7+ 507.0+ 552.9+ 627.9* 149.1144.7143.7 HighAndTonsHighOr 141.0+ 135.4+ 162.4+ 153.4+ 169.7* 154.0+ 150.0+ 149.1+ HighAndTonsLowNot -49.9-66.2-46.8-76.9 -30.3-73.7-28.6-15.6 HighAndTonsLowOr -22.4-69.4-30.2-67.5 -41.9-63.8-24.4-13.9 LowAndSomeHighNot 3.7 -2.6 -9.0 -7.3 -6.2 4.5+ 6.2* 4.7+ LowAndSomeHighOr 1.5-14.0-15.5-10.8 -12.0 6.8* 5.8+ 6.6+ LowAndSomeLowNot -26.4-43.7-56.5-47.3 -43.7 3.7*-2.3 -4.0 LowAndSomeLowOr -23.2-41.8-60.5-46.2 -43.4 2.2*-2.3 -8.8 LowAndTonsHighNot 380.6+ 171.5118.4248.3 381.8*22.5 23.8 26.5 LowAndTonsHighOr29.8* 5.2 -1.1 10.7 5.4 24.2+27.5+28.2+ LowAndTonsLowNot28.9 9.1-39.3 5.3 1.3 39.1+47.2*44.3+ LowAndTonsLowOr30.9+ 7.2-38.1 0.5 9.0 29.9+40.9*38.1+ Task Good Method HighAndSomeHighNot BS, HighAndSomeHighOr BS, size9, size8, size5, HighAndSomeLowNot HighAndSomeLowOr HighAndTonsHighNot llbs, ll, BitSet, ArrayNotDel, BS, HighAndTonsHighOr llbs, BitSet, size5, ll, size8, size9, ArrayNotDel, BS, HighAndTonsLowNot HighAndTonsLowOr LowAndSomeHighNot size8, size9, size5, LowAndSomeHighOr size5, size9, size8, LowAndSomeLowNot size5, LowAndSomeLowOr size5, LowAndTonsHighNot llbs, ArrayNotDel, LowAndTonsHighOr ArrayNotDel, size9, size8, size5, LowAndTonsLowNot size8, size9, size5, LowAndTonsLowOr size8, size9, ArrayNotDel, size5, {code} BS perferms the best for HighAndSomeHigh* cases. {quote} For the rules on when to use which scorer, it seems like we should take the .cost() of the sub-clauses into account somehow... {quote} I have already take .cost() into account see the rules in the decider. {code} if (!required.isEmpty() optional.size() 3) { float times = (float) required.get(0).cost() / optional.get(0).cost(); if (times 1) return new BooleanNovelScorer(weight, disableCoord, minShouldMatch, required, optional, prohibited, maxCoord); } if (!required.isEmpty() prohibited.size() 3) { float times = (float) required.get(0).cost() / prohibited.get(0).cost(); if (times 1) return new BooleanNovelScorer(weight, disableCoord, minShouldMatch, required, optional, prohibited, maxCoord); } {code} Here, I just take the first scorer's cost into account, as it may cost a lot to iterate all
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch This is the first try to merge scorers, so that we can get a better perf of boolean retrieval. I create a new class named BooleanMixedScorerDecider to choose the best scorer. Rules for choosing remains to be improved. I have been working on it to find an elegant way to define rules. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndSomeLowNot 11.53 (7.3%) 10.75 (10.1%) -6.8% ( -22% - 11%) HighAndTonsLowNot4.87 (4.0%)4.64 (6.0%) -4.9% ( -14% -5%) LowAndSomeLowOr 306.20 (2.2%) 299.06 (2.8%) -2.3% ( -7% -2%) HighAndSomeLowOr 13.67 (9.4%) 13.38 (2.7%) -2.1% ( -13% - 11%) HighAndTonsLowOr4.04 (6.4%)3.96 (1.9%) -1.9% ( -9% -6%) LowAndSomeLowNot 215.18 (1.9%) 211.14 (2.2%) -1.9% ( -5% -2%) PKLookup 96.26 (2.3%) 94.56 (2.8%) -1.8% ( -6% -3%) HighAndTonsHighNot0.06 (2.3%)0.06 (2.6%) -1.0% ( -5% -4%) HighAndTonsHighOr0.06 (0.6%)0.06 (1.3%) 0.9% ( 0% -2%) HighAndSomeHighNot1.59 (2.2%)1.62 (2.9%) 1.7% ( -3% -6%) LowAndSomeHighNot 66.33 (2.1%) 68.77 (2.1%) 3.7% ( 0% -8%) LowAndSomeHighOr 53.75 (1.6%) 56.86 (2.1%) 5.8% ( 1% -9%) LowAndTonsLowNot 14.00 (1.7%) 14.84 (1.5%) 6.1% ( 2% -9%) HighAndSomeHighOr2.39 (2.2%)2.68 (3.5%) 12.4% ( 6% - 18%) LowAndTonsLowOr 17.69 (0.9%) 21.64 (1.7%) 22.3% ( 19% - 25%) LowAndTonsHighOr1.83 (1.3%)2.33 (2.4%) 27.2% ( 23% - 31%) LowAndTonsHighNot1.15 (1.5%)1.51 (3.1%) 30.9% ( 25% - 36%) {code} BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp, stat.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: all.perf stat.cpp I have retested previous explored methods, and do an statistic on their performance. The file all.perf is the original perf. data. stat.cpp is used to do an statistic on all.perf. {code} g++ -std=c++0x stat.cpp -o stat ./stat all.perf {code} The perf. statistic results are showed as follows. {code} Task ArrayNotDel BS BitSet ll llbssize5size8size9 HighAndSomeHighNot 0.7 15.3* 7.4 8.9 2.0 6.6 10.0 3.4 HighAndSomeHighOr13.3 24.5* 7.8 9.1 10.9 17.3+18.3+21.3+ HighAndSomeLowNot -45.1-53.9-55.0-57.3 -45.5-47.8-42.2-41.5 HighAndSomeLowOr -44.7-55.4-51.2-58.1 -54.5-47.9-39.7-44.9 HighAndTonsHighNot 475.7+ 472.7+ 507.0+ 552.9+ 627.9* 149.1144.7143.7 HighAndTonsHighOr 141.0+ 135.4+ 162.4+ 153.4+ 169.7* 154.0+ 150.0+ 149.1+ HighAndTonsLowNot -49.9-66.2-46.8-76.9 -30.3-73.7-28.6-15.6 HighAndTonsLowOr -22.4-69.4-30.2-67.5 -41.9-63.8-24.4-13.9 LowAndSomeHighNot 3.7 -2.6 -9.0 -7.3 -6.2 4.5+ 6.2* 4.7+ LowAndSomeHighOr 1.5-14.0-15.5-10.8 -12.0 6.8* 5.8+ 6.6+ LowAndSomeLowNot -26.4-43.7-56.5-47.3 -43.7 3.7*-2.3 -4.0 LowAndSomeLowOr -23.2-41.8-60.5-46.2 -43.4 2.2*-2.3 -8.8 LowAndTonsHighNot 380.6+ 171.5118.4248.3 381.8*22.5 23.8 26.5 LowAndTonsHighOr29.8* 5.2 -1.1 10.7 5.4 24.2+27.5+28.2+ LowAndTonsLowNot28.9 9.1-39.3 5.3 1.3 39.1+47.2*44.3+ LowAndTonsLowOr30.9+ 7.2-38.1 0.5 9.0 29.9+40.9*38.1+ Task Good Method HighAndSomeHighNot BS, HighAndSomeHighOr BS, size9, size8, size5, HighAndSomeLowNot HighAndSomeLowOr HighAndTonsHighNot llbs, ll, BitSet, ArrayNotDel, BS, HighAndTonsHighOr llbs, BitSet, size5, ll, size8, size9, ArrayNotDel, BS, HighAndTonsLowNot HighAndTonsLowOr LowAndSomeHighNot size8, size9, size5, LowAndSomeHighOr size5, size9, size8, LowAndSomeLowNot size5, LowAndSomeLowOr size5, LowAndTonsHighNot llbs, ArrayNotDel, LowAndTonsHighOr ArrayNotDel, size9, size8, size5, LowAndTonsLowNot size8, size9, size5, LowAndTonsLowOr size8, size9, ArrayNotDel, size5, {code} Among them, 'll' is the linkedlist docs method, while 'llbs' is the linkedlist with bitset. Character '*' marks the best perf, while '+' marks ones some kind of as good as the best perf. I have been merging these methods. I decided to move the scorer choosing logic into a new class, but a bug come to me. I'm working on it. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, all.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp, stat.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch This patch is based on git mirror commit ce7d0578b30981d15687bf76aec595274efccbad In this patch, I just compact the array as I go through the MUST_NOT docs. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowNot4.88 (3.5%)2.44 (4.4%) -49.9% ( -55% - -43%) HighAndSomeLowNot6.55 (6.1%)3.60 (4.7%) -45.1% ( -52% - -36%) HighAndSomeLowOr9.93 (12.9%)5.49 (4.7%) -44.7% ( -55% - -31%) LowAndSomeLowNot 293.78 (2.3%) 216.29 (1.7%) -26.4% ( -29% - -22%) LowAndSomeLowOr 347.60 (1.8%) 266.94 (1.2%) -23.2% ( -25% - -20%) HighAndTonsLowOr5.59 (5.7%)4.34 (4.4%) -22.4% ( -30% - -13%) PKLookup 97.38 (2.1%) 95.54 (2.9%) -1.9% ( -6% -3%) HighAndSomeHighNot1.88 (2.2%)1.89 (6.6%) 0.7% ( -7% -9%) LowAndSomeHighOr 41.32 (2.9%) 41.92 (2.8%) 1.5% ( -4% -7%) LowAndSomeHighNot 54.74 (2.4%) 56.73 (2.7%) 3.7% ( -1% -8%) HighAndSomeHighOr2.26 (2.7%)2.56 (6.8%) 13.3% ( 3% - 23%) LowAndTonsLowNot 17.18 (1.2%) 22.14 (2.4%) 28.9% ( 24% - 32%) LowAndTonsHighOr1.21 (2.7%)1.57 (4.4%) 29.8% ( 22% - 37%) LowAndTonsLowOr 17.22 (1.3%) 22.53 (2.4%) 30.9% ( 26% - 35%) HighAndTonsHighOr0.07 (1.2%)0.16 (13.1%) 141.0% ( 125% - 157%) LowAndTonsHighNot2.02 (2.4%)9.70 (9.7%) 380.6% ( 360% - 402%) HighAndTonsHighNot0.09 (1.2%)0.50 (23.1%) 475.7% ( 446% - 505%) {code} Besides, I am working combine all explored method to get a better perf now. In order to get more accurate perf of each method, I'm retesting some previous methods now. It may take several days to make a combined method work. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063089#comment-14063089 ] Da Huang commented on LUCENE-4396: -- Thank you, Mike! {quote} It looks like this gave some nice gains with the many-not cases {quote} Yes, but many-not cases may not be a usual case. Therefore, this method might be used in the final method. {quote} Curiously some of the tasks are really hurt by the larger sizes ... maybe 19 is a good compromise? {quote} Yeah. Finally, I will just focus on those \*Some\* cases. size9 is better for HighAndSomeHighOr case, while size5 is better for LowAndSomeHighOr, LowAndSomeLowNot and LowAndSomeLowOr cases. I think it would be better to detect the case type and adjust the SIZE of bucketTable in BNS's constructor. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063089#comment-14063089 ] Da Huang edited comment on LUCENE-4396 at 7/16/14 3:53 AM: --- Thank you, Mike! {quote} It looks like this gave some nice gains with the many-not cases {quote} Yes, but many-not cases might not be a usual case. Therefore, this method might not be used in the final method. {quote} Curiously some of the tasks are really hurt by the larger sizes ... maybe 19 is a good compromise? {quote} Yeah. Finally, I will just focus on those \*Some\* cases. size9 is better for HighAndSomeHighOr case, while size5 is better for LowAndSomeHighOr, LowAndSomeLowNot and LowAndSomeLowOr cases. I think it would be better to detect the case type and adjust the SIZE of bucketTable in BNS's constructor. was (Author: dhuang): Thank you, Mike! {quote} It looks like this gave some nice gains with the many-not cases {quote} Yes, but many-not cases may not be a usual case. Therefore, this method might be used in the final method. {quote} Curiously some of the tasks are really hurt by the larger sizes ... maybe 19 is a good compromise? {quote} Yeah. Finally, I will just focus on those \*Some\* cases. size9 is better for HighAndSomeHighOr case, while size5 is better for LowAndSomeHighOr, LowAndSomeLowNot and LowAndSomeLowOr cases. I think it would be better to detect the case type and adjust the SIZE of bucketTable in BNS's constructor. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: SIZE.perf stat.cpp I have done tests for different SIZE of bucketTable. The file 'SIZE.perf' is the original test result data. 'stat.cpp' is a C++ program used to do statistic on *.perf files. You can compile it with 'g++ stat.cpp -std=c++0x -o stat' and run by './stat SIZE.perf' The statistic result for SIZE.perf is supposed to be as follows. {code} Task size10 size11 size5 size6 size7 size8 size9 HighAndSomeHighNot -14.5 4.0 6.6 -3.0 5.210.0*3.4 HighAndSomeHighOr 2.410.917.3 17.412.918.321.3* HighAndSomeLowNot -36.8 -37.3 -47.8 -47.8 -40.2 -42.2 -41.5 HighAndSomeLowOr -45.1 -46.4 -47.9 -46.2 -38.7 -39.7 -44.9 HighAndTonsHighNot 162.4* 145.1 149.1 130.1 142.9 144.7 143.7 HighAndTonsHighOr 154.8* 146.5 154.0 137.8 144.9 150.0 149.1 HighAndTonsLowNot -27.0 -17.4 -73.7 -49.6 -40.1 -28.6 -15.6 HighAndTonsLowOr -28.7 -14.3 -63.8 -44.8 -33.0 -24.4 -13.9 LowAndSomeHighNot 3.0 0.2 4.5 6.2*5.7 6.2*4.7 LowAndSomeHighOr 5.3 1.4 6.8* 6.7 7.7 5.8 6.6 LowAndSomeLowNot-6.3 -24.4 3.7* 0.8 1.7-2.3-4.0 LowAndSomeLowOr -10.3 -22.7 2.2* 2.0 1.7-2.3-8.8 LowAndTonsHighNot27.3* 21.422.5 21.521.023.826.5 LowAndTonsHighOr23.128.224.2 23.929.1* 27.528.2 LowAndTonsLowNot33.046.539.1 33.430.047.2* 44.3 LowAndTonsLowOr45.7* 34.629.9 36.845.340.938.1 {code} size7 means the bucketTable's size is 1 7. It seems that we can get a better result on *SOME* tasks if we combine size9 with size5. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060615#comment-14060615 ] Da Huang edited comment on LUCENE-4396 at 7/14/14 1:04 PM: --- I have done tests for different SIZE of bucketTable. The file 'SIZE.perf' is the original test result data. 'stat.cpp' is a C++ program used to do statistic on *.perf files. You can compile it with 'g++ stat.cpp -std=c++0x -o stat' and run by './stat SIZE.perf' The statistic result for SIZE.perf is supposed to be as follows. {code} Task size10 size11 size5 size6 size7 size8 size9 HighAndSomeHighNot -14.5 4.0 6.6 -3.0 5.210.0*3.4 HighAndSomeHighOr 2.410.917.3 17.412.918.321.3* HighAndSomeLowNot -36.8 -37.3 -47.8 -47.8 -40.2 -42.2 -41.5 HighAndSomeLowOr -45.1 -46.4 -47.9 -46.2 -38.7 -39.7 -44.9 HighAndTonsHighNot 162.4* 145.1 149.1 130.1 142.9 144.7 143.7 HighAndTonsHighOr 154.8* 146.5 154.0 137.8 144.9 150.0 149.1 HighAndTonsLowNot -27.0 -17.4 -73.7 -49.6 -40.1 -28.6 -15.6 HighAndTonsLowOr -28.7 -14.3 -63.8 -44.8 -33.0 -24.4 -13.9 LowAndSomeHighNot 3.0 0.2 4.5 6.2*5.7 6.2*4.7 LowAndSomeHighOr 5.3 1.4 6.8* 6.7 7.7 5.8 6.6 LowAndSomeLowNot-6.3 -24.4 3.7* 0.8 1.7-2.3-4.0 LowAndSomeLowOr -10.3 -22.7 2.2* 2.0 1.7-2.3-8.8 LowAndTonsHighNot27.3* 21.422.5 21.521.023.826.5 LowAndTonsHighOr23.128.224.2 23.929.1* 27.528.2 LowAndTonsLowNot33.046.539.1 33.430.047.2* 44.3 LowAndTonsLowOr45.7* 34.629.9 36.845.340.938.1 {code} size7 means the bucketTable's size is 1 7. It seems that we can get a better result on \*SOME\* tasks if we combine size9 with size5. was (Author: dhuang): I have done tests for different SIZE of bucketTable. The file 'SIZE.perf' is the original test result data. 'stat.cpp' is a C++ program used to do statistic on *.perf files. You can compile it with 'g++ stat.cpp -std=c++0x -o stat' and run by './stat SIZE.perf' The statistic result for SIZE.perf is supposed to be as follows. {code} Task size10 size11 size5 size6 size7 size8 size9 HighAndSomeHighNot -14.5 4.0 6.6 -3.0 5.210.0*3.4 HighAndSomeHighOr 2.410.917.3 17.412.918.321.3* HighAndSomeLowNot -36.8 -37.3 -47.8 -47.8 -40.2 -42.2 -41.5 HighAndSomeLowOr -45.1 -46.4 -47.9 -46.2 -38.7 -39.7 -44.9 HighAndTonsHighNot 162.4* 145.1 149.1 130.1 142.9 144.7 143.7 HighAndTonsHighOr 154.8* 146.5 154.0 137.8 144.9 150.0 149.1 HighAndTonsLowNot -27.0 -17.4 -73.7 -49.6 -40.1 -28.6 -15.6 HighAndTonsLowOr -28.7 -14.3 -63.8 -44.8 -33.0 -24.4 -13.9 LowAndSomeHighNot 3.0 0.2 4.5 6.2*5.7 6.2*4.7 LowAndSomeHighOr 5.3 1.4 6.8* 6.7 7.7 5.8 6.6 LowAndSomeLowNot-6.3 -24.4 3.7* 0.8 1.7
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060615#comment-14060615 ] Da Huang edited comment on LUCENE-4396 at 7/14/14 1:06 PM: --- I have done tests for different SIZE of bucketTable. The file 'SIZE.perf' is the original test result data. 'stat.cpp' is a C++ program used to do statistic on *.perf files. You can compile it with 'g++ stat.cpp -std=c++0x -o stat' and run by './stat SIZE.perf' The statistic result for SIZE.perf is supposed to be as follows. {code} Task size10 size11 size5 size6 size7 size8 size9 HighAndSomeHighNot -14.5 4.0 6.6 -3.0 5.210.0*3.4 HighAndSomeHighOr 2.410.917.3 17.412.918.321.3* HighAndSomeLowNot -36.8 -37.3 -47.8 -47.8 -40.2 -42.2 -41.5 HighAndSomeLowOr -45.1 -46.4 -47.9 -46.2 -38.7 -39.7 -44.9 HighAndTonsHighNot 162.4* 145.1 149.1 130.1 142.9 144.7 143.7 HighAndTonsHighOr 154.8* 146.5 154.0 137.8 144.9 150.0 149.1 HighAndTonsLowNot -27.0 -17.4 -73.7 -49.6 -40.1 -28.6 -15.6 HighAndTonsLowOr -28.7 -14.3 -63.8 -44.8 -33.0 -24.4 -13.9 LowAndSomeHighNot 3.0 0.2 4.5 6.2*5.7 6.2*4.7 LowAndSomeHighOr 5.3 1.4 6.8* 6.7 7.7 5.8 6.6 LowAndSomeLowNot-6.3 -24.4 3.7* 0.8 1.7-2.3-4.0 LowAndSomeLowOr -10.3 -22.7 2.2* 2.0 1.7-2.3-8.8 LowAndTonsHighNot27.3* 21.422.5 21.521.023.826.5 LowAndTonsHighOr23.128.224.2 23.929.1* 27.528.2 LowAndTonsLowNot33.046.539.1 33.430.047.2* 44.3 LowAndTonsLowOr45.7* 34.629.9 36.845.340.938.1 {code} size7 means the bucketTable's size is 1 7. the character '*', which is added manually, marks the best value. It seems that we can get a better result on \*Some\* tasks if we combine size9 with size5. was (Author: dhuang): I have done tests for different SIZE of bucketTable. The file 'SIZE.perf' is the original test result data. 'stat.cpp' is a C++ program used to do statistic on *.perf files. You can compile it with 'g++ stat.cpp -std=c++0x -o stat' and run by './stat SIZE.perf' The statistic result for SIZE.perf is supposed to be as follows. {code} Task size10 size11 size5 size6 size7 size8 size9 HighAndSomeHighNot -14.5 4.0 6.6 -3.0 5.210.0*3.4 HighAndSomeHighOr 2.410.917.3 17.412.918.321.3* HighAndSomeLowNot -36.8 -37.3 -47.8 -47.8 -40.2 -42.2 -41.5 HighAndSomeLowOr -45.1 -46.4 -47.9 -46.2 -38.7 -39.7 -44.9 HighAndTonsHighNot 162.4* 145.1 149.1 130.1 142.9 144.7 143.7 HighAndTonsHighOr 154.8* 146.5 154.0 137.8 144.9 150.0 149.1 HighAndTonsLowNot -27.0 -17.4 -73.7 -49.6 -40.1 -28.6 -15.6 HighAndTonsLowOr -28.7 -14.3 -63.8 -44.8 -33.0 -24.4 -13.9 LowAndSomeHighNot 3.0 0.2 4.5 6.2*5.7 6.2*4.7 LowAndSomeHighOr 5.3 1.4 6.8* 6.7 7.7 5.8 6.6 LowAndSomeLowNot-6.3
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055653#comment-14055653 ] Da Huang commented on LUCENE-4396: -- Thanks for you suggestions, mike. {quote} Maybe try testing different values of SIZE? {quote} Hmm, that's a good idea. {quote} When you fold in the MUST_NOT clauses you could just compact the array as you go, instead of having a separate valid bool? {quote} Oh, that's is a great idae! I will do that on next patch. {quote} I think we should start moving this towards something committable? I.e., of all the approaches you've explored, let's take the most promising and fold them into a new scorer, and then work on the logic/heuristics for when this scorer should and shouldn't apply? {quote} Yeah, I agree. I am working on that. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch This is a patch based on the git mirror commit 7f66461aea7bc2cb6f31a993cba77734e5e0f9d9. In this patch, I take the bucketTable as an array but not a hash table. It seems that its perf. is better than former patches' on most cases. As you know, after putting required docs into bucketTable, I have to scan both the table and optional docs. Here, I have tried skipping to scan the bucketTable to improve the perf. The results is as follows. {code} No skip TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowNot6.56 (3.1%)2.59 (1.0%) -60.5% ( -62% - -58%) HighAndTonsLowOr6.43 (3.3%)2.58 (0.8%) -59.9% ( -61% - -57%) HighAndSomeLowOr8.49 (8.5%)4.05 (1.8%) -52.3% ( -57% - -45%) HighAndSomeLowNot6.17 (8.6%)3.16 (2.1%) -48.8% ( -54% - -41%) LowAndSomeLowOr 250.58 (2.0%) 194.86 (1.6%) -22.2% ( -25% - -18%) LowAndSomeLowNot 178.66 (1.6%) 147.67 (2.2%) -17.3% ( -20% - -13%) LowAndSomeHighOr 40.71 (2.8%) 41.50 (1.8%) 2.0% ( -2% -6%) PKLookup 97.59 (3.0%) 99.52 (4.6%) 2.0% ( -5% -9%) LowAndSomeHighNot 20.76 (3.0%) 21.54 (2.3%) 3.7% ( -1% -9%) HighAndSomeHighNot2.22 (1.7%)2.67 (4.4%) 20.3% ( 13% - 26%) LowAndTonsHighNot3.81 (2.3%)4.60 (2.1%) 20.8% ( 15% - 25%) LowAndTonsHighOr2.87 (2.3%)3.48 (2.6%) 21.2% ( 15% - 26%) HighAndSomeHighOr1.74 (2.1%)2.16 (3.5%) 24.0% ( 18% - 30%) LowAndTonsLowOr 18.66 (1.3%) 23.68 (1.9%) 26.9% ( 23% - 30%) LowAndTonsLowNot 16.01 (1.4%) 22.16 (2.8%) 38.4% ( 33% - 43%) HighAndTonsHighOr0.04 (0.9%)0.11 (9.8%) 158.2% ( 146% - 170%) HighAndTonsHighNot0.06 (1.1%)0.15 (13.5%) 166.2% ( 149% - 182%) --- Binary search skip TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowNot6.22 (3.8%)2.45 (0.9%) -60.6% ( -62% - -58%) HighAndSomeLowOr8.29 (11.2%)4.40 (3.0%) -46.9% ( -54% - -36%) HighAndSomeLowNot 12.34 (7.1%)6.65 (2.6%) -46.1% ( -52% - -39%) LowAndSomeLowOr 232.38 (2.9%) 165.05 (1.8%) -29.0% ( -32% - -24%) HighAndTonsLowOr5.17 (6.2%)3.75 (3.0%) -27.4% ( -34% - -19%) LowAndSomeLowNot 227.71 (2.6%) 171.13 (3.2%) -24.8% ( -29% - -19%) HighAndSomeHighOr1.35 (3.9%)1.14 (3.5%) -16.1% ( -22% - -9%) LowAndSomeHighOr 50.17 (3.6%) 48.84 (3.7%) -2.7% ( -9% -4%) LowAndSomeHighNot 52.71 (3.0%) 51.55 (3.8%) -2.2% ( -8% -4%) PKLookup 90.17 (3.5%) 91.38 (3.3%) 1.3% ( -5% -8%) HighAndSomeHighNot1.69 (2.9%)2.00 (6.3%) 18.5% ( 8% - 28%) LowAndTonsLowOr 15.61 (1.9%) 18.59 (2.8%) 19.0% ( 14% - 24%) LowAndTonsHighOr1.82 (2.7%)2.20 (4.6%) 20.7% ( 13% - 28%) LowAndTonsLowNot 15.51 (1.7%) 20.14 (3.8%) 29.8% ( 23% - 35%) LowAndTonsHighNot1.01 (2.9%)1.34 (6.5%) 31.7% ( 21% - 42%) HighAndTonsHighOr0.07 (0.9%)0.12 (6.9%) 77.7% ( 69% - 86%) HighAndTonsHighNot0.07 (1.4%)0.19 (11.9%) 162.4% ( 146% - 178%) --- 8 steps skip TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowNot5.45 (3.3%)1.69 (1.3%) -69.0% ( -71% - -66%) HighAndSomeLowOr5.46 (11.0%)2.76 (4.4%) -49.5% ( -58% - -38%) HighAndSomeLowNot 17.94 (5.7%) 10.40 (3.8%) -42.1% ( -48% - -34%) LowAndSomeLowOr 306.62 (1.7%) 231.45 (1.5%) -24.5% ( -27% - -21%) LowAndSomeLowNot 286.30 (1.7%) 218.13 (2.0%) -23.8% ( -27% - -20%) HighAndTonsLowOr6.34
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039754#comment-14039754 ] Da Huang commented on LUCENE-4396: -- {quote} Looks like you separated required optional scores in the non-DAAT impls and then carefully cast to float at the right times? {quote} Yes, you get what I mean. {quote} you can remove that TODO in ConjunctionScorer on switching sum to double? {quote} OK, I will do that on next patch. {quote} {quote} So BooleanScorerIO is just like BooleanNovelScorer, except it uses a bitset instead of linked list to track the set buckets? Between BNS and BSIO which one is faster? {quote} Yes. exactly. According to perf. tests, it seems that BNS do better for those tasks faster than the trunk, while do better for those tasks slower than the trunk. {quote} Why does BSIO/NS see massive gains on the tasks that have so many NOT clauses? I think in trunk/4.x today, we are not scoring the NOT clauses, right? While these gains are sizable, I think it's not a common use case... {quote} The reason is that when we search for +a -b -c -d, lucene actually do +a -(b c d) and the cost of getting disjunction of (b c d) is huge. Indeed, such case may not be a common case. {quote} I think you've explored a number of options here and now we need to see if we can make this committable, e.g. figure out how to have BooleanQuery pick the right scorer for the situation? Somehow we need logic that looks at how many / cost of the sub-clauses and picks the right scorer? {quote} Yeah, you're right. Besides, a new idea has come up to me. For BNS, we actually does not make use of the hash feature of BucketTable. Thus, I think we should not take BucketTable as a hash table (ie. do not place doc to the absolute place buckets[doc MASK]). Firstly, we get 2K required docs to BucketTable. Then, we do TAAT on these 2K docs. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039754#comment-14039754 ] Da Huang edited comment on LUCENE-4396 at 6/21/14 10:00 AM: {quote} Looks like you separated required optional scores in the non-DAAT impls and then carefully cast to float at the right times? {quote} Yes, you get what I mean. {quote} you can remove that TODO in ConjunctionScorer on switching sum to double? {quote} OK, I will do that on next patch. {quote} So BooleanScorerIO is just like BooleanNovelScorer, except it uses a bitset instead of linked list to track the set buckets? Between BNS and BSIO which one is faster? {quote} Yes. exactly. According to perf. tests, it seems that BNS do better for those tasks faster than the trunk, while do better for those tasks slower than the trunk. {quote} Why does BSIO/NS see massive gains on the tasks that have so many NOT clauses? I think in trunk/4.x today, we are not scoring the NOT clauses, right? While these gains are sizable, I think it's not a common use case... {quote} The reason is that when we search for +a -b -c -d, lucene actually do +a -(b c d) and the cost of getting disjunction of (b c d) is huge. Indeed, such case may not be a common case. {quote} I think you've explored a number of options here and now we need to see if we can make this committable, e.g. figure out how to have BooleanQuery pick the right scorer for the situation? Somehow we need logic that looks at how many / cost of the sub-clauses and picks the right scorer? {quote} Yeah, you're right. Besides, a new idea has come up to me. For BNS, we actually does not make use of the hash feature of BucketTable. Thus, I think we should not take BucketTable as a hash table (ie. do not place doc to the absolute place buckets[doc MASK]). Firstly, we get 2K required docs to BucketTable. Then, we do TAAT on these 2K docs. was (Author: dhuang): {quote} Looks like you separated required optional scores in the non-DAAT impls and then carefully cast to float at the right times? {quote} Yes, you get what I mean. {quote} you can remove that TODO in ConjunctionScorer on switching sum to double? {quote} OK, I will do that on next patch. {quote} {quote} So BooleanScorerIO is just like BooleanNovelScorer, except it uses a bitset instead of linked list to track the set buckets? Between BNS and BSIO which one is faster? {quote} Yes. exactly. According to perf. tests, it seems that BNS do better for those tasks faster than the trunk, while do better for those tasks slower than the trunk. {quote} Why does BSIO/NS see massive gains on the tasks that have so many NOT clauses? I think in trunk/4.x today, we are not scoring the NOT clauses, right? While these gains are sizable, I think it's not a common use case... {quote} The reason is that when we search for +a -b -c -d, lucene actually do +a -(b c d) and the cost of getting disjunction of (b c d) is huge. Indeed, such case may not be a common case. {quote} I think you've explored a number of options here and now we need to see if we can make this committable, e.g. figure out how to have BooleanQuery pick the right scorer for the situation? Somehow we need logic that looks at how many / cost of the sub-clauses and picks the right scorer? {quote} Yeah, you're right. Besides, a new idea has come up to me. For BNS, we actually does not make use of the hash feature of BucketTable. Thus, I think we should not take BucketTable as a hash table (ie. do not place doc to the absolute place buckets[doc MASK]). Firstly, we get 2K required docs to BucketTable. Then, we do TAAT on these 2K docs. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch This is a patch based on git mirror commit 8f9b823db1d6fba2cc7ec61b0596970f3c8bbe85. The following things are done in this patch. 1. Solve the problem of score diff. between pure DAAT(ie. BS2, as BS2 does not exist now, I think it may be better to call it pure DAAT) and BS completely. 2. Add a new Scorer named BooleanScorerInOrder which uses only bitset but not linked list to collect docs. I create this new Scorer but not change the old BS, because I think BS may be more useful in some cases. For now, BSIO does not support the cases where there is no any MUST clause, because the procedure for such cases is totally different from cases with MUST clause. The perf. of BSIO is as follows. {code} TaskQPS baseline StdDevQPS my_version StdDev Pct diff LowAndSomeLowOr 259.82 (2.3%) 102.70 (2.8%) -60.5% ( -64% - -56%) LowAndSomeLowNot 184.38 (2.8%) 80.26 (2.3%) -56.5% ( -59% - -52%) HighAndSomeLowNot 10.44 (7.2%)4.70 (4.3%) -55.0% ( -61% - -46%) HighAndSomeLowOr 18.11 (8.0%)8.83 (4.0%) -51.2% ( -58% - -42%) HighAndTonsLowNot3.03 (5.4%)1.62 (4.7%) -46.8% ( -53% - -38%) LowAndTonsLowNot 14.59 (1.2%)8.86 (2.0%) -39.3% ( -41% - -36%) LowAndTonsLowOr 14.11 (1.1%)8.74 (3.0%) -38.1% ( -41% - -34%) HighAndTonsLowOr5.52 (4.3%)3.85 (5.2%) -30.2% ( -38% - -21%) LowAndSomeHighOr 24.97 (3.5%) 21.10 (3.2%) -15.5% ( -21% - -9%) LowAndSomeHighNot 25.51 (3.3%) 23.22 (3.4%) -9.0% ( -15% - -2%) LowAndTonsHighOr1.66 (2.6%)1.64 (2.8%) -1.1% ( -6% -4%) PKLookup 95.22 (5.5%) 96.64 (6.1%) 1.5% ( -9% - 13%) HighAndSomeHighNot2.37 (2.0%)2.55 (6.9%) 7.4% ( -1% - 16%) HighAndSomeHighOr2.25 (2.7%)2.43 (6.0%) 7.8% ( 0% - 16%) LowAndTonsHighNot2.72 (2.3%)5.94 (5.8%) 118.4% ( 107% - 129%) HighAndTonsHighOr0.05 (0.8%)0.12 (17.0%) 162.4% ( 143% - 181%) HighAndTonsHighNot0.08 (1.3%)0.48 (23.4%) 507.0% ( 476% - 538%) {code} BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025945#comment-14025945 ] Da Huang commented on LUCENE-4396: -- Hmm. I mean ConjunctionScorer does not use PQ, and it can be faster to use it rather than enumerating all the matching docs for each MUST. As for .advance, I'm not sure whether its cost can exceed .next much enough, so that using .advance will be slower than using .next in this case. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021505#comment-14021505 ] Da Huang commented on LUCENE-4396: -- {quote}True, but maybe in such cases (low freqs for the clauses) we should just use BS2. I think BS/BNS do better for high-freq clauses?{quote} I'm sorry that I could not be sure whether it's ture now, as I haven't made a closer analysis on the perf results. The perf of BS/BNS depends on many factors, such as freq of each clause and the number of SHOULD(and MUST_NOT) clauses. {quote}I think we may get better performance when the MUST clauses are high freq, if we just use BooleanScorer to enumerate all the matching docs for each MUST instead of going through ConjunctionScorer?{quote} I afraid that enumerating all the matching docs would not get better perf. In fact, BS2 and ConjunctionScorer collect docs by the method called document-at-a-time(DAAT), while BS/BNS is something like a combination of DAAT and term-at-a-time(TAAT). For conjunctive clauses, it's more efficient to use DAAT than TAAT, as DAAT scans fewer docs than TAAT. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017708#comment-14017708 ] Da Huang commented on LUCENE-4396: -- Thanks for your suggestions, Mike! {quote} When you say BNS (without bitset) vs. BS2 that means baseline=BS2 and my_version=BNS (without bitset)? {quote} Yes, this is just what I mean. {quote} With the added bitset, couldn't you not use a linked list anymore? Ie, just use prev/nextSetBit. I wonder if the bitset (instead of the linked list) could also help BooleanScorer? Maybe test this change separately (e.g. just modify BS we have today on trunk) to see if it helps or hurts... if it does help, it seems like BNS could be used (or BS could be a Scorer not a BulkScorer) even when there are no MUST clauses? Ie, the bitset lets us easily keep the order. Then we can merge BS/BNS into one? {quote} Oh, that's a good idea! I will try that. However, linked list can be helpful when required docs is extremly sparse. {quote} Could you attach all new tasks as a single file in general? Note that when you set up a luceneutil test, you can add a task filter using addTaskPattern, so you run just a subset of the tasks for that one test. {quote} Do you mean merging And.tasks and AndOr.tasks ? If so, there's no need to do that, because And.tasks contains all tasks in AndOr.tasks, although tasks' names are changed. All the way, thanks for the advice on using addTaskPattern. I haven't noticed that. {quote} Strange that the scores are still different between BS/BS2 and BNS/BS2 when using double. {quote} I don't think it strange. Because the difference is due to the score calculating order. Supposed that a doc hits +a b c, SCORE_BS = (float)((float)(double)score_a + (float)score_b) + (float)score_c, while SCORE_BS2 = (float)(double)score_a + ((float)score_b + (float)score_c). Here, (float) means that we can only get the score by .score() whose return type is float. The modification on this patch can only make score_a has a temp double value. {quote} If there's only 1 required clause sent to BS/BNS can't we use its scorer instead? Have you explored having BS interact directly with all the MUST clauses, rather than using ConjunctionScorer? {quote} Hmm. I don't think that would be helpful. The reason is just the same as above. {quote} Because we have wildly divergent results (sometimes one is much faster, other times it's much slower) we will somehow need to add logic to pick the right scorer for each query. But we can defer this until we're doneish iterating the changes to each scorer... it can come later on. {quote} Yes, I agree. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018417#comment-14018417 ] Da Huang commented on LUCENE-4396: -- About scores diff. on BS/BS2 (the same as BNS/BS2) Now, there's scores diff. on BS/BS2, when excuting query like +a b c d I have been told that the reason is indicate by the TODO on ReqOptSumScorer.score() which says that {code} // TODO: sum into a double and cast to float if we ever send required clauses to BS1 {code} However, I don't think so, as the score bias is due to different score calculating orders. Supposed that a doc hits the query +a b c d, the score calculated by BS is {code} BS.score(doc) = ((a.score() + b.score()) + c.score()) + d.score() {code} while the score calculated by BS2 is {code} BS2.score(doc) = a.score() + (float)(b.score() + c.score() + d.score()) {code} Notice that, in BS2, we can only get the float value of (b.score() + c.score() + d.score()) by reqScorer.score(). Furthermore, I have noticed that actually we can control the BS's score calulating order, so that {code} BS.score(doc) = a.score() + ((b.score() + c.score()) + d.score()) {code} However, for BS2, we do not know the calculating order of (b.score() + c.score() + d.score()), as the order is determined by scorer's position in a heap. I still think this matters little. I will rearrange the calculating order of BS.score() at next patch, to see whether it works. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch And.tasks A patch based on lucene github mirror commit cf10341825ff6bd1662dd48c51926bc51d751ce5. I use a bitset to skip required docs when scaning optional and prohibited docs. The perf. comparison is at the bottom. Besides, I build a new tasks file the test the perf. and I discover that BNS optimize the +a -b -c -d ... case a lot, when b c d ... hits many docs. code BNS (without bitset) vs. BS2 TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowNot4.29 (2.9%)1.08 (0.6%) -74.8% ( -76% - -73%) HighAndTonsLowOr4.87 (6.4%)1.24 (1.0%) -74.4% ( -76% - -71%) HighAndSomeLowNot9.03 (5.2%)4.11 (4.1%) -54.4% ( -60% - -47%) HighAndSomeLowOr 16.21 (9.6%)7.75 (4.1%) -52.2% ( -60% - -42%) LowAndSomeLowOr 303.28 (2.4%) 183.14 (6.6%) -39.6% ( -47% - -31%) LowAndSomeLowNot 257.24 (1.8%) 157.07 (6.5%) -38.9% ( -46% - -31%) LowAndSomeHighOr 36.78 (1.9%) 33.74 (3.0%) -8.3% ( -12% - -3%) LowAndTonsLowNot 21.28 (2.0%) 19.69 (6.9%) -7.5% ( -16% -1%) LowAndSomeHighNot 34.40 (1.6%) 33.69 (3.2%) -2.1% ( -6% -2%) PKLookup 100.63 (4.8%) 103.46 (4.7%) 2.8% ( -6% - 12%) LowAndTonsHighOr1.26 (1.6%)1.41 (1.7%) 11.8% ( 8% - 15%) LowAndTonsLowOr 13.66 (0.9%) 15.50 (6.0%) 13.5% ( 6% - 20%) HighAndSomeHighNot2.65 (1.4%)3.12 (6.5%) 17.6% ( 9% - 25%) HighAndSomeHighOr2.21 (2.4%)2.62 (5.8%) 18.6% ( 10% - 27%) HighAndTonsHighOr0.07 (0.8%)0.19 (10.5%) 160.3% ( 147% - 172%) LowAndTonsHighNot2.86 (1.6%) 10.24 (18.1%) 257.7% ( 234% - 281%) HighAndTonsHighNot0.05 (0.8%)0.40 (28.2%) 641.8% ( 607% - 676%) BS vs. BS2 TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowOr4.02 (6.8%)0.87 (0.5%) -78.2% ( -80% - -76%) HighAndTonsLowNot4.95 (3.4%)1.29 (0.9%) -73.9% ( -75% - -72%) HighAndSomeLowOr 14.45 (9.5%)6.68 (3.7%) -53.8% ( -61% - -44%) HighAndSomeLowNot 14.78 (5.1%)7.48 (3.9%) -49.4% ( -55% - -42%) LowAndSomeLowOr 316.55 (2.2%) 170.14 (5.6%) -46.3% ( -52% - -39%) LowAndSomeLowNot 283.47 (1.7%) 157.35 (6.0%) -44.5% ( -51% - -37%) LowAndSomeHighOr 39.39 (2.0%) 35.07 (3.1%) -11.0% ( -15% - -6%) LowAndSomeHighNot 53.96 (2.0%) 48.57 (3.8%) -10.0% ( -15% - -4%) LowAndTonsLowNot 17.97 (1.5%) 17.04 (6.0%) -5.2% ( -12% -2%) PKLookup 97.57 (2.7%) 100.21 (5.2%) 2.7% ( -5% - 10%) LowAndTonsHighOr3.59 (1.7%)3.74 (2.4%) 4.1% ( 0% -8%) LowAndTonsLowOr 14.71 (1.3%) 15.63 (5.7%) 6.3% ( 0% - 13%) HighAndSomeHighNot1.84 (1.3%)2.05 (5.6%) 11.2% ( 4% - 18%) HighAndSomeHighOr1.93 (2.1%)2.16 (5.6%) 11.9% ( 4% - 20%) HighAndTonsHighOr0.05 (1.0%)0.13 (14.1%) 144.8% ( 128% - 161%) LowAndTonsHighNot1.63 (1.9%)4.95 (7.2%) 204.0% ( 191% - 217%) HighAndTonsHighNot0.06 (1.0%)0.34 (18.2%) 459.6% ( 435% - 483%) BNS (with bitset) vs. BS2 TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndSomeLowOr7.45 (12.0%)3.49 (6.6%) -53.1% ( -64% - -39%) HighAndSomeLowNot 10.45 (8.0%)5.25 (6.8%) -49.7% ( -59% - -37%) LowAndSomeLowOr 310.53 (2.3%) 168.56 (5.8%) -45.7% ( -52% - -38%) LowAndSomeLowNot 292.05 (2.3%) 165.88 (5.7%) -43.2% ( -50% - -36%) HighAndTonsLowNot5.94 (3.5%)4.33 (6.8%) -27.0% ( -36% - -17%) HighAndTonsLowOr5.92 (4.4%)4.39 (6.0%) -25.9% ( -34% - -16%) LowAndSomeHighNot 53.79 (2.4%) 47.71 (2.8%) -11.3
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016491#comment-14016491 ] Da Huang edited comment on LUCENE-4396 at 6/3/14 1:24 PM: -- A patch based on lucene github mirror commit cf10341825ff6bd1662dd48c51926bc51d751ce5. I use a bitset to skip required docs when scaning optional and prohibited docs. The perf. comparison is at the bottom. Besides, I build a new tasks file the test the perf. and I discover that BNS optimize the +a -b -c -d ... case a lot, when b c d ... hits many docs. {code} BNS (without bitset) vs. BS2 TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowNot4.29 (2.9%)1.08 (0.6%) -74.8% ( -76% - -73%) HighAndTonsLowOr4.87 (6.4%)1.24 (1.0%) -74.4% ( -76% - -71%) HighAndSomeLowNot9.03 (5.2%)4.11 (4.1%) -54.4% ( -60% - -47%) HighAndSomeLowOr 16.21 (9.6%)7.75 (4.1%) -52.2% ( -60% - -42%) LowAndSomeLowOr 303.28 (2.4%) 183.14 (6.6%) -39.6% ( -47% - -31%) LowAndSomeLowNot 257.24 (1.8%) 157.07 (6.5%) -38.9% ( -46% - -31%) LowAndSomeHighOr 36.78 (1.9%) 33.74 (3.0%) -8.3% ( -12% - -3%) LowAndTonsLowNot 21.28 (2.0%) 19.69 (6.9%) -7.5% ( -16% -1%) LowAndSomeHighNot 34.40 (1.6%) 33.69 (3.2%) -2.1% ( -6% -2%) PKLookup 100.63 (4.8%) 103.46 (4.7%) 2.8% ( -6% - 12%) LowAndTonsHighOr1.26 (1.6%)1.41 (1.7%) 11.8% ( 8% - 15%) LowAndTonsLowOr 13.66 (0.9%) 15.50 (6.0%) 13.5% ( 6% - 20%) HighAndSomeHighNot2.65 (1.4%)3.12 (6.5%) 17.6% ( 9% - 25%) HighAndSomeHighOr2.21 (2.4%)2.62 (5.8%) 18.6% ( 10% - 27%) HighAndTonsHighOr0.07 (0.8%)0.19 (10.5%) 160.3% ( 147% - 172%) LowAndTonsHighNot2.86 (1.6%) 10.24 (18.1%) 257.7% ( 234% - 281%) HighAndTonsHighNot0.05 (0.8%)0.40 (28.2%) 641.8% ( 607% - 676%) BS vs. BS2 TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndTonsLowOr4.02 (6.8%)0.87 (0.5%) -78.2% ( -80% - -76%) HighAndTonsLowNot4.95 (3.4%)1.29 (0.9%) -73.9% ( -75% - -72%) HighAndSomeLowOr 14.45 (9.5%)6.68 (3.7%) -53.8% ( -61% - -44%) HighAndSomeLowNot 14.78 (5.1%)7.48 (3.9%) -49.4% ( -55% - -42%) LowAndSomeLowOr 316.55 (2.2%) 170.14 (5.6%) -46.3% ( -52% - -39%) LowAndSomeLowNot 283.47 (1.7%) 157.35 (6.0%) -44.5% ( -51% - -37%) LowAndSomeHighOr 39.39 (2.0%) 35.07 (3.1%) -11.0% ( -15% - -6%) LowAndSomeHighNot 53.96 (2.0%) 48.57 (3.8%) -10.0% ( -15% - -4%) LowAndTonsLowNot 17.97 (1.5%) 17.04 (6.0%) -5.2% ( -12% -2%) PKLookup 97.57 (2.7%) 100.21 (5.2%) 2.7% ( -5% - 10%) LowAndTonsHighOr3.59 (1.7%)3.74 (2.4%) 4.1% ( 0% -8%) LowAndTonsLowOr 14.71 (1.3%) 15.63 (5.7%) 6.3% ( 0% - 13%) HighAndSomeHighNot1.84 (1.3%)2.05 (5.6%) 11.2% ( 4% - 18%) HighAndSomeHighOr1.93 (2.1%)2.16 (5.6%) 11.9% ( 4% - 20%) HighAndTonsHighOr0.05 (1.0%)0.13 (14.1%) 144.8% ( 128% - 161%) LowAndTonsHighNot1.63 (1.9%)4.95 (7.2%) 204.0% ( 191% - 217%) HighAndTonsHighNot0.06 (1.0%)0.34 (18.2%) 459.6% ( 435% - 483%) BNS (with bitset) vs. BS2 TaskQPS baseline StdDevQPS my_version StdDev Pct diff HighAndSomeLowOr7.45 (12.0%)3.49 (6.6%) -53.1% ( -64% - -39%) HighAndSomeLowNot 10.45 (8.0%)5.25 (6.8%) -49.7% ( -59% - -37%) LowAndSomeLowOr 310.53 (2.3%) 168.56 (5.8%) -45.7% ( -52% - -38%) LowAndSomeLowNot 292.05 (2.3%) 165.88 (5.7%) -43.2% ( -50% - -36%) HighAndTonsLowNot5.94 (3.5%)4.33 (6.8%) -27.0% ( -36% - -17%) HighAndTonsLowOr5.92 (4.4%)4.39 (6.0%) -25.9% ( -34% - -16%) LowAndSomeHighNot 53.79
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008299#comment-14008299 ] Da Huang edited comment on LUCENE-4396 at 5/25/14 8:44 AM: --- The patch is based on lucene github mirror commit cfb408ff6788e6fea8215098a785d72fb4e95c5b. The following things have been done: 1. Rename TestBooleanNovelScorer to TestBooleanUnevenly, and this test suit test both BNS and BS when hit documents' distribution is unevenly. 2. Following Robert's advice, I sum scores into a double and cast to float in ConjunctionScorer. However, it seems to take little effect. Scores difference problem still remain. 3. Add a comment to scores difference within tolerance on luceneutil. 4. Make a new tasks file, which can test AndSomeOR cases. 5. Run luceneutil for BNS vs BS2 and BS vs BS2. The result is showed as follows. P.S. BS has the same problem with score difference as BNS. Althrough there's no BS2 now as the architecture has changed, here I still call it BS2 for convenience. {code} BNS vs BS2 TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighAndTonsLowOr 10.95 (3.5%)1.52 (0.3%) -86.1% ( -86% - -85%) HighAndSomeLowOr 29.98 (6.7%) 11.84 (2.9%) -60.5% ( -65% - -54%) LowAndSomeLowOr 756.81 (1.4%) 503.21 (2.8%) -33.5% ( -37% - -29%) LowAndSomeHighOr 54.25 (2.1%) 53.26 (2.1%) -1.8% ( -5% -2%) PKLookup 241.74 (2.8%) 241.96 (2.3%) 0.1% ( -4% -5%) LowAndTonsLowOr 40.23 (1.2%) 43.19 (7.2%) 7.4% ( 0% - 15%) LowAndTonsHighOr2.63 (2.1%)2.99 (2.3%) 13.8% ( 9% - 18%) HighAndSomeHighOr4.99 (1.8%)5.86 (4.7%) 17.4% ( 10% - 24%) HighAndTonsHighOr0.09 (1.5%)0.22 (8.1%) 145.4% ( 133% - 157%) BS vs BS2 TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighAndTonsLowOr 16.54 (2.4%)3.70 (0.2%) -77.6% ( -78% - -76%) HighAndSomeLowOr 11.95 (8.5%)4.29 (0.8%) -64.1% ( -67% - -59%) LowAndSomeLowOr 839.11 (1.9%) 540.83 (2.5%) -35.5% ( -39% - -31%) LowAndSomeHighOr 149.50 (2.6%) 136.71 (3.4%) -8.6% ( -14% - -2%) HighAndSomeHighOr3.72 (1.7%)3.51 (1.7%) -5.6% ( -8% - -2%) PKLookup 240.32 (2.8%) 238.87 (2.8%) -0.6% ( -6% -5%) LowAndTonsHighOr4.96 (2.3%)5.35 (3.8%) 7.8% ( 1% - 14%) LowAndTonsLowOr 35.28 (1.2%) 39.00 (5.2%) 10.6% ( 4% - 17%) HighAndTonsHighOr0.16 (1.1%)0.36 (4.0%) 122.6% ( 116% - 129%) {code} was (Author: dhuang): The patch is based on lucene github mirror commit cfb408ff6788e6fea8215098a785d72fb4e95c5b. The following things have been done: 1. Rename TestBooleanNovelScorer to TestBooleanUnevenly, and this test suit test both BNS and BS when hit documents' distribution is unevenly. 2. Following Robert's advice, I sum scores into a double and cast to float in ConjunctionScorer. However, it seems to take little effect. Scores difference problem still remain. 3. Add a comment to scores difference within tolerance on luceneutil. 4. Make a new tasks file, which can test AndSomeOR cases. 5. Run luceneutil for BNS vs BS2 and BS vs BS2. The result is showed as follows. {code} BNS vs BS2 TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighAndTonsLowOr 10.95 (3.5%)1.52 (0.3%) -86.1% ( -86% - -85%) HighAndSomeLowOr 29.98 (6.7%) 11.84 (2.9%) -60.5% ( -65% - -54%) LowAndSomeLowOr 756.81 (1.4%) 503.21 (2.8%) -33.5% ( -37% - -29%) LowAndSomeHighOr 54.25 (2.1%) 53.26 (2.1%) -1.8% ( -5% -2%) PKLookup 241.74 (2.8%) 241.96 (2.3%) 0.1% ( -4% -5%) LowAndTonsLowOr 40.23 (1.2%) 43.19 (7.2%) 7.4% ( 0% - 15%) LowAndTonsHighOr2.63 (2.1%)2.99 (2.3%) 13.8% ( 9% - 18%) HighAndSomeHighOr4.99 (1.8%)5.86 (4.7%) 17.4% ( 10% - 24%) HighAndTonsHighOr0.09 (1.5%)0.22 (8.1%) 145.4% ( 133% - 157%) BS vs BS2 TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001595#comment-14001595 ] Da Huang commented on LUCENE-4396: -- Thanks for your reply. {quote}OK, maybe add a comment just saying something temporarily commented out so NovelBS is invoked instead of BS?{quote} I will comment that. {quote} And what exception did luceneutil throw...?{quote} It just says that hit %s has wrong field/score value %s vs %s, and the perf. test abort. And the score value diff. is about 0.01 . BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001706#comment-14001706 ] Da Huang commented on LUCENE-4396: -- {quote} I'm nervous about the luceneutil change, just because I don't want to encourage complacency on scores being different in general. {quote} I agree. but it seems that the original perf. tasks file has too few items on each case to discover scores' difference, when the scorer's calculating orders are different. Actually, if I decrease the items in my tasks file on each case to 3, the scores are the same with the trunk. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002604#comment-14002604 ] Da Huang commented on LUCENE-4396: -- Thanks for your advice, Robert. Do you mean just changing {code} float sum = 0.0f; {code} to {code} double sum = 0.0f; {code} ? However, I'm not sure doing this will really be enough for scoring differences, as the differences are due to different calculating order. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002620#comment-14002620 ] Da Huang commented on LUCENE-4396: -- Oh, thanks. I think it‘s worth a try. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998736#comment-13998736 ] Da Huang commented on LUCENE-4396: -- Thanks for your suggestions! {quote} maybe we could test on fewer terms, for the Low/HighAndManyLow/High tasks? I think it's more common to have a handful (3-5 maybe) of terms. {quote} When terms are few, BooleanNovelScorer performs slower than BS (about -10%). However, I have to generate tasks with fewer terms and rerun the tasks to reconfirm the specific perf. difference. {quote} But maybe keep your current category and rename it to Tons instead of Many? {quote} OK, I will do so. {quote} Maybe we can improve the test so that it exercises BS and NBS? E.g., toggle the require docs in order via a custom collector? {quote} Yes, I think that's a good idea. {quote} Hmm do we know why the scores changed? {quote} Yes, it's because the calculating orders are different. BS adds up scores of all SHOULD clauses, and then add their sum to the final score. BNS adds score of each SHOULD clause to final score one by one. {quote} Are we comparing BS2 to NovelBS? {quote} Yes. {quote} I think BS and BS2 already have different scores today? {quote} Yes. Actually, the score calculating order of BS is the same as BNS. {quote} but you commented this out in your patch in order to test NBS I guess? {quote} yes, I did that in order to test BNS. Otherwise, luceneutil would throw exception. {quote} Do you have any perf results of BS w/ required clauses (as a BulkScorer) vs BS2 (what trunk does today)? {quote} Hmm, I haven't carried out such experiment yet. Checking the perf. results of BS vs BS2 is a good idea. I will do that. :) BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch Add TestBooleanNovelScorer.java to detect the bug on the second patch. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch The patch is based on the github mirror commit c1e423e45e6fa9f846ab2c382c0100fd515b4cb1. The following things are done in this patch: 1. Fix the bug on last patch. The bug is due to not setting prev and next to null before add an element to a linked list. 2. Refine the code style. 3. Make a small improvement on .advance(). The performance is a little better than the last patch, but still worse than the trunk, when testing on luceneutil. P.S. The bug on last patch can not be detected by ant-test, but can be found by running query like +a b on luceneutil. I'm getting to add a junit test case which can detect the bug, but it may take me some days. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: AndOr.tasks luceneutil tasks file to test queries like +a b c d e ... The performance shows as follows. TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighAndManyLowOr8.50 (3.3%)1.72 (0.3%) -79.8% ( -80% - -78%) PKLookup 239.75 (0.9%) 239.99 (0.9%) 0.1% ( -1% -1%) LowAndManyHighOr7.11 (1.4%)7.76 (1.4%) 9.1% ( 6% - 12%) LowAndManyLowOr 33.83 (0.7%) 41.03 (2.7%) 21.3% ( 17% - 24%) HighAndManyHighOr0.12 (0.7%)0.29 (7.8%) 148.0% ( 138% - 157%) BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: luceneutil-score-equal.patch A patch for luceneutil, which allows scores is different within a tolerance range. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994363#comment-13994363 ] Da Huang edited comment on LUCENE-4396 at 5/11/14 12:56 AM: luceneutil tasks file to test queries like +a b c d e ... The performance shows as follows. ||TaskQPS || baseline || StdDevQPS || my_modified_version || StdDev || Pct diff || | HighAndManyLowOr| 8.50| (3.3%)| 1.72| (0.3%) | -79.8% ( -80% - -78%) | | PKLookup| 239.75 |(0.9%) | 239.99| (0.9%) | 0.1% ( -1% -1%) | |LowAndManyHighOr| 7.11| (1.4%) | 7.76| (1.4%) | 9.1% ( 6% - 12%) | |LowAndManyLowOr|33.83| (0.7%) | 41.03| (2.7%) | 21.3% ( 17% - 24%) | |HighAndManyHighOr| 0.12 |(0.7%) | 0.29 | (7.8%) | 148.0% ( 138% - 157%) | was (Author: dhuang): luceneutil tasks file to test queries like +a b c d e ... The performance shows as follows. TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighAndManyLowOr8.50 (3.3%)1.72 (0.3%) -79.8% ( -80% - -78%) PKLookup 239.75 (0.9%) 239.99 (0.9%) 0.1% ( -1% -1%) LowAndManyHighOr7.11 (1.4%)7.76 (1.4%) 9.1% ( 6% - 12%) LowAndManyLowOr 33.83 (0.7%) 41.03 (2.7%) 21.3% ( 17% - 24%) HighAndManyHighOr0.12 (0.7%)0.29 (7.8%) 148.0% ( 138% - 157%) BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: AndOr.tasks, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985254#comment-13985254 ] Da Huang commented on LUCENE-4396: -- Thanks for your suggestions, Mike. And sorry for my late reply. {quote} Hmm, the patch didn't cleanly apply, but I was able to work through it. I think your dev area is not up to date with trunk? {quote} I haven't merged my branch to the newest trunk version, because my network account at school for April has been run out and I couldn't pull the code from github untill 1 May. Sorry for that. {quote} Small code style things {quote} I'm very sorry for the code style. That's my fault. Very sorry for that. {quote} So it looks like BooleanNovelScorer is able to be a Scorer because the linked-list of visited buckets in one window are guaranteed to be in docID order, because we first visit the requiredConjunctionScorer's docs in that window. {quote} Yes, you're right. {quote} Have you tested performance when the .advance method here isn't called? Ie, just boolean queries w/ one MUST and one or more SHOULD? {quote} No, I haven't. Do you mean the .advance method of subScorers in BooleanNovelScorer? If so, I will do that. If you mean the .advance method of BooleanNovelScorer itself, I think it would be confusing, because BooleanNovelScorer now is used when there's at least one MUST clause, no matter whether it acts as a top scorer or not. Therefore, .advance() of BooleanNovelScorer must be called when BooleanNovelScorer acts as a non-top scorer. {quote} I think the important question here is whether/in what cases the BooleanNovelScorer approach beats BooleanScorer2 performance? {quote} Yes, you're right. But BooleanNovelScorer has not been totally finished, and the performance itself remans to be improved especially its .advance method. {quote} I realized LUCENE-4872 is related here, i.e. we should also sometimes use BooleanScorer for the minShouldMatch1 case. {quote} Yes, I also notice that. :) I think this issue should be dealed with together. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4396.patch, LUCENE-4396.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch I create a new class named BooleanNovelScorer in this iteration. This scorer is based on the techinque of BooleanScorer, but can make use of the skipping list while collecting documents. Moreover, it is a subclass of Scorer which can act as a non-top scorer. However, the performance is low now, because I have not implemented its .advance() in a efficent way. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4396.patch, LUCENE-4396.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Da Huang updated LUCENE-4396: - Attachment: LUCENE-4396.patch BooleanScorer can support MUST clause (ie. requiredScorers) now. The patch is based on commit 9e87821edeb3e24ca8dedaecf856f6510d61d0d3 on github. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4396.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969418#comment-13969418 ] Da Huang commented on LUCENE-4396: -- Currently, I pass ListScorer requiredScorers to BooleanScorer, and merge them as ConjunctionScorer. For consistency, I should probably change the argument from ListScorer requiredScorers to ListBulkScorer requiredScorers, but, as a result, getScorer method should be added to BulkScorer. Besides, I removed the static statement on BooleanScorerCollector and BucketTable, because I have to refer the member requiredNrMatchers of BooleanScorer. But, I'm so sure whether removing the static statement is a proper option. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4396.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970193#comment-13970193 ] Da Huang commented on LUCENE-4396: -- These suggestions are very helpful. Thank you.:) Adding up a mustClauseCountMatches counter would be low-efficient, as it can not make use of list skippings. How about implementing getScorer returning null for BulkScorer, while returning the scorer for DefaultBulkScorer. I'm not very sure whether passing ListBulkScorer instead of ListScorer is really necessary. So I think this issue should probably be just set aside for now. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4396.patch Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942978#comment-13942978 ] Da Huang commented on LUCENE-4396: -- Oh, that's very true. According to the current design, using advance will hurt is some cases. However, I think such cases may be able to be solved by building a skipping list on MUST-all-hit bulks and skipping MUST-all-hit bulks when scanning SHOULD, but I haven't made this idea very clear in my mind. So making BooleanBulkScorer is still necessary now. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942978#comment-13942978 ] Da Huang edited comment on LUCENE-4396 at 3/21/14 11:19 AM: Oh, that's very true. According to the current design, using advance will hurt in such case. However, I think such case may be able to be solved by building a skipping list on MUST-all-hit bulks and skipping MUST-all-hit bulks when scanning SHOULD, but I haven't made this idea very clear in my mind. So making BooleanBulkScorer is still necessary now. was (Author: dhuang): Oh, that's very true. According to the current design, using advance will hurt is some cases. However, I think such cases may be able to be solved by building a skipping list on MUST-all-hit bulks and skipping MUST-all-hit bulks when scanning SHOULD, but I haven't made this idea very clear in my mind. So making BooleanBulkScorer is still necessary now. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941775#comment-13941775 ] Da Huang commented on LUCENE-4396: -- Sorry for my late reply. I have been thinking about the new code/design on the trunk these days. The new code breaks out BulkScorer from Scorer, and it is necessary to create a new BooleanScorer (a Scorer), just as you said. I'm afraid that we do have to take Scorer instead as subScorer in the new BooleanScorer. And yes: BooleanBulkScorer should not be embeded as its docIDs are out of order. My idea is to keep BooleanBulkScorer just supporting no-MUST-clause case, and let the new BooleanScorer to deal with the case where there is at least one MUST clause. I think this is one of the best ways to be compatible with the current design. Besides, I'm afraid that the name of BulkScorer may be confusing. The new BooleanScorer is also implemented by scoring a range of documents at once, but it actually can act as Sub-Scorer. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942577#comment-13942577 ] Da Huang commented on LUCENE-4396: -- I'm afraid that if BooleanBulkScorer also handle MUST, it couldn't make use of .advance(), as its subScorers are BulkScorer which could not call .advance(). BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942684#comment-13942684 ] Da Huang commented on LUCENE-4396: -- A new iteration on the proposal has just been submitted. The new iteration has added a part Supplementary Notes to describe how to fit my design to the new design on the current lucene trunk, such as renaming BooleanScorer to BooleanBulkScorer, creating a new BooleanScorer extended from Scorer. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940342#comment-13940342 ] Da Huang commented on LUCENE-4396: -- Hi, Mike. I have just finished revising my proposal. I'm not so sure about the decription on this page unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2.. In my opinion, even when MUST clauses have very low hit count compared to the other clauses, BooleanScorer is likely to perform better than BooleanScorer2, because the calling on .advance() when dealing with SHOULD clauses can skip documents as many as BooleanScorer2 does. Relevant ideas is described in the session Improve the Rule for Choosing Scorer. As it's not very consistent with the description on this page, I'm not sure whether my idea makes sense. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937412#comment-13937412 ] Da Huang edited comment on LUCENE-4396 at 3/17/14 2:14 AM: --- I'm revising and polishing my proposal these days, and I have discovered an interesting thing. That is: if BooleanScorer supports required scorers in the way I have proposed, docIDs would be in acsending order in the bucket table. I think this can make BooleanScorer be a Not-Top Scorer, as .advance() .docID() .nextDoc() etc. can be implemented. However, I'm not sure how it would affect the performance when it acts as a Not-Top Scorer. This is because when .nextDoc() or .advance() is called, BooleanScorer may calculate a 2K window whose data may not be all useful. I hope I have made my idea clear. was (Author: dhuang): I'm revising and polishing my proposal these days, and I have discovered a interesting thing. That is: if BooleanScorer supports required scorers in the way I have proposed, docIDs would be in acsending order in the bucket table. I think this can make BooleanScorer be a Not-Top Scorer, as .advance() .docID() .nextDoc() etc. can be implemented. However, I'm not sure how it would affect the performance when it acts as a Not-Top Scorer. This is because when .nextDoc() or .advance() is called, BooleanScorer may calculate a 2K window whose data may not be all useful. I hope I have made my idea clear. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937412#comment-13937412 ] Da Huang commented on LUCENE-4396: -- I'm revising and polishing my proposal these days, and I have discovered a interesting thing. That is: if BooleanScorer supports required scorers in the way I have proposed, docIDs would be in acsending order in the bucket table. I think this can make BooleanScorer be a Not-Top Scorer, as .advance() .docID() .nextDoc() etc. can be implemented. However, I'm not sure how it would affect the performance when it acts as a Not-Top Scorer. This is because when .nextDoc() or .advance() is called, BooleanScorer may calculate a 2K window whose data may not be all useful. I hope I have made my idea clear. BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses
[ https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935929#comment-13935929 ] Da Huang commented on LUCENE-4396: -- I have just set the GSoC proposal's visibility as Public, and the public URL is this: http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/dhuang/5629499534213120 BooleanScorer should sometimes be used for MUST clauses --- Key: LUCENE-4396 URL: https://issues.apache.org/jira/browse/LUCENE-4396 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT. If there is one or more MUST clauses we always use BooleanScorer2. But I suspect that unless the MUST clauses have very low hit count compared to the other clauses, that BooleanScorer would perform better than BooleanScorer2. BooleanScorer still has some vestiges from when it used to handle MUST so it shouldn't be hard to bring back this capability ... I think the challenging part might be the heuristics on when to use which (likely we would have to use firstDocID as proxy for total hit count). Likely we should also have BooleanScorer sometimes use .advance() on the subs in this case, eg if suddenly the MUST clause skips 100 docs then you want to .advance() all the SHOULD clauses. I won't have near term time to work on this so feel free to take it if you are inspired! -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GoSC] I'm interested in LUCENE-3333
Hi, Mike. You're right. After having a look at the comments on LUCENE-1518, I find that my idea about that has many bugs. Sorry for that. Thus, I have checked some other suggestions you gave me to see whether relevant comments can be found in jira. I think I have some idea on LUCENE-4396: BooleanScorer should sometimes be used for MUST clauses. Can we adjust the query to make the problem easier? For the query +a b c +e +f as an example, maybe we can turn it into (+a +e +f) b c which has only one MUST clause. Then, it would be easier to judge which scorer to use? Besides, I seems that the suggestion we should pass a needsScorers boolean up-front to Weight.scorer is not on jira. But it sounds that it can be done by adjusting some class methods' arguments and return value to pass the needsScorers? not sure. At last, recently I find something strange in the code about heap. I find heap has been implemented duplicately for many times in the trunk, and a PriorityQueue is also implemented in the package org.apache.lucene.util. I remember java has already implemented the PriorityQueue. Why not use that? Thanks, Da Huang -- 黄达(Da Huang) Team of Search Engine Web Mining School of Electronic Engineering Computer Science Peking University, Beijing, 100871, P.R.China
Re: [GoSC] I'm interested in LUCENE-3333
Thanks a lot. That's very helpful. I think you get exactly what I mean about the LUCENE-4396. By grouping up the MUST clauses, the conjunctive query can be done specifiedly with easy way. Then, the original query would have no more than 1 MUST clause. I think in this situation, it's much more easier to judge whether to use BooleanScorer or BooleanScorer2. :) Thanks, Da Huang -- 黄达(Da Huang) Team of Search Engine Web Mining School of Electronic Engineering Computer Science Peking University, Beijing, 100871, P.R.China
Re: [GoSC] I'm interested in LUCENE-3333
Hello, Mike. I have spent some time considering your suggestions in last mail. I find that I'm interested in the suggestion Filter and Query should be more 'combined' . In my opinion, to implement this suggestion, a new class FilterQuery, which is a subclass of Query, should be created. If FilterQuery is implemented, then it can be the query element of BooleanClause, and the BooleanQuery can naturally add a Filter as a BooleanClause. I think one of the most important things is to deal with the scores, as Filter does not contribute anything to score. Above is my intuitive idea about this suggestion. Do you think it makes sense? I hope I have made my idea clear. Thanks, Da Huang -- 黄达(Da Huang) Team of Search Engine Web Mining School of Electronic Engineering Computer Science Peking University, Beijing, 100871, P.R.China
[GoSC] I'm interested in LUCENE-3333
Hello, everyone, My name is Da Huang. I'm studying for my master degree of Computer Science in Peking University. I have been using lucene for about half a year. It's so elegent that I hope to have a chance to contribute some code for it. Therefore, I have been scaned the jira GoSC 2014 Ideas page about lucene for several days. I find LUCENE-: Specialize DisjunctionScorer if all clauses are TermQueries more suitable for me to do. I have spent some time to scan the revelant code, and the Issue LUCENE-3328 which spinoff LUCENE-. I find the following questions confusing me. 1) I have checkout the code from http://svn.apache.org/repos/asf/lucene/dev/trunk lucene_trunk, but I couldn't find the relevant code of the fixed Issue LUCENE-3328. It seems that the patch attached on the page is not on the trunk. Why? 2) My intuitive idea of solving this issue is to make a class DisjunctionTermScorer to do the all TermQueries clauses; then, judging whether to use DisjunctionTermScorer in the method 'scorer' in class BooleanQuery. Is this idea right? Above are my questions about LUCENE-. Besides, I would like to propose the following issue which is about the QueryParser. When we use QueryParser to parse a querystring like science AND (engineering AND technology). The generated query would be +science (+engineering +technology). I think it would be more efficient for searching if the final query is +science +engineering +technology. My idea is to make the cascaded AND and cascaded OR flat. Do you agree? I hope I have made my idea clear. Thanks, Da Huang -- 黄达(Da Huang) Team of Search Engine Web Mining School of Electronic Engineering Computer Science Peking University, Beijing, 100871, P.R.China