[ https://issues.apache.org/jira/browse/LUCENE-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-7641: --------------------------------- Attachment: LUCENE-7641.patch Here is a patch that does not compute the inverse cost and folds in the two ideas that I mentioned above to make point count estimation more accurate. I also ran a benchmark with luceneutil10m: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff Fuzzy1 190.46 (7.7%) 176.23 (12.6%) -7.5% ( -25% - 13%) Respell 277.90 (3.6%) 275.11 (3.0%) -1.0% ( -7% - 5%) MedSpanNear 150.45 (2.5%) 149.27 (3.1%) -0.8% ( -6% - 4%) Wildcard 112.61 (6.4%) 111.76 (5.9%) -0.8% ( -12% - 12%) OrNotHighHigh 60.90 (3.6%) 60.51 (4.3%) -0.6% ( -8% - 7%) AndHighLow 1168.30 (3.9%) 1160.93 (3.8%) -0.6% ( -8% - 7%) OrHighMed 41.35 (5.1%) 41.16 (5.2%) -0.4% ( -10% - 10%) OrHighNotMed 69.18 (3.9%) 68.94 (4.1%) -0.3% ( -8% - 8%) OrHighNotHigh 47.52 (3.8%) 47.39 (4.1%) -0.3% ( -7% - 7%) LowTerm 691.05 (4.1%) 689.34 (4.5%) -0.2% ( -8% - 8%) HighSloppyPhrase 20.35 (3.1%) 20.31 (3.1%) -0.2% ( -6% - 6%) HighSpanNear 27.30 (2.0%) 27.25 (1.8%) -0.2% ( -3% - 3%) LowPhrase 101.02 (2.2%) 100.92 (2.8%) -0.1% ( -4% - 4%) Fuzzy2 68.31 (5.8%) 68.25 (7.1%) -0.1% ( -12% - 13%) LowSpanNear 170.99 (2.7%) 170.87 (3.4%) -0.1% ( -6% - 6%) LowSloppyPhrase 55.34 (1.9%) 55.30 (2.1%) -0.1% ( -4% - 4%) OrHighLow 84.81 (3.8%) 84.81 (4.1%) 0.0% ( -7% - 8%) AndHighMed 294.21 (2.3%) 294.28 (2.5%) 0.0% ( -4% - 4%) Prefix3 32.27 (8.8%) 32.28 (8.0%) 0.0% ( -15% - 18%) HighTermDayOfYearSort 65.88 (2.4%) 65.92 (2.8%) 0.1% ( -5% - 5%) HighPhrase 16.67 (2.1%) 16.70 (1.8%) 0.1% ( -3% - 4%) MedPhrase 82.73 (3.2%) 82.86 (3.0%) 0.2% ( -5% - 6%) MedSloppyPhrase 42.22 (2.1%) 42.29 (2.3%) 0.2% ( -4% - 4%) OrNotHighLow 889.62 (3.3%) 891.74 (4.9%) 0.2% ( -7% - 8%) OrHighHigh 32.35 (5.2%) 32.43 (5.2%) 0.3% ( -9% - 11%) OrNotHighMed 158.41 (3.3%) 158.83 (3.6%) 0.3% ( -6% - 7%) AndHighHigh 47.84 (1.4%) 47.98 (1.8%) 0.3% ( -2% - 3%) MedTerm 240.59 (4.1%) 241.51 (3.6%) 0.4% ( -7% - 8%) OrHighNotLow 111.19 (4.7%) 111.65 (5.3%) 0.4% ( -9% - 10%) HighTerm 100.65 (4.3%) 101.63 (4.6%) 1.0% ( -7% - 10%) HighTermMonthSort 140.07 (11.1%) 141.91 (10.5%) 1.3% ( -18% - 25%) IntNRQ10 76.64 (9.9%) 80.71 (2.8%) 5.3% ( -6% - 19%) IntNRQ50 17.16 (11.4%) 18.11 (2.7%) 5.6% ( -7% - 22%) IntNRQ25 33.52 (11.0%) 35.61 (3.0%) 6.2% ( -6% - 22%) IntNRQ 12.62 (11.8%) 15.84 (4.4%) 25.6% ( 8% - 47%) IntNRQ75 11.57 (11.7%) 15.18 (4.7%) 31.2% ( 13% - 54%) IntNRQ90 9.71 (11.6%) 13.84 (5.6%) 42.6% ( 22% - 67%) {noformat} This looks like a good speedup to me. The ranges that are followed by a number are some additional ranges that I added in order to see the impact based on the percentage of matched documents. For instance {{IntNRQ75}} has ranges that match 75% of documents. Here are the tasks that I appended to {{wikimedium.10M.nostopwords.tasks}} if you would like to reproduce: {noformat} IntNRQ10: nrq//timesecnum 6207 15563 IntNRQ10: nrq//timesecnum 53 8742 IntNRQ10: nrq//timesecnum 77160 84811 IntNRQ25: nrq//timesecnum 8899 36335 IntNRQ25: nrq//timesecnum 68109 86101 IntNRQ25: nrq//timesecnum 44771 64123 IntNRQ50: nrq//timesecnum 101 51223 IntNRQ50: nrq//timesecnum 25811 69953 IntNRQ50: nrq//timesecnum 46298 82710 IntNRQ75: nrq//timesecnum 13867 79886 IntNRQ75: nrq//timesecnum 8830 75922 IntNRQ75: nrq//timesecnum 17003 82111 IntNRQ90: nrq//timesecnum 1810 80159 IntNRQ90: nrq//timesecnum 6160 84811 IntNRQ90: nrq//timesecnum 8210 86330 {noformat} It is interesting that small ranges seem to also benefit from a small speedup (I reran the benchmark to confirm and it looks reproducible). I am unsure why, but suspect that the fact that we use a different code path for large ranges makes things easier for the JVM to optimize on small ranges (?). > Speed up point ranges that match most documents > ----------------------------------------------- > > Key: LUCENE-7641 > URL: https://issues.apache.org/jira/browse/LUCENE-7641 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Attachments: LUCENE-7461.patch, LUCENE-7641.patch > > > If a point range matches most documents and every document has exactly one > value, then we could make things faster by computing the set of documents > that do NOT match the range instead. > It was not possible until recently since figuring out whether a range query > matches most documents was not possible, but we can now use the new > {{PointValues.estimatePointcount}} API to do that: we could just check > whether the cost of the inverse visitor is lower than the cost of the regular > range visitor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org