zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1475711031
Hi @jpountz, I was able to create a sorted index with new low-cardinality field `quarter`, run some new benchmark tasks like below, and see substantial improvement for the new tasks (with around -7% impacts to not-long-running terms): ``` AndHighNotQuarter: +last -quarter:q1 # freq=830278 AndHighNotQuarter: +united -quarter:q2 # freq=1185528 AndHighNotQuarter: +year -quarter:q3 # freq=1098425 AndHighNotQuarter: +its -quarter:q4 # freq=1160703 AndHighNotQuarter: +but -quarter:q1 # freq=1484398 AndMedNotQuarter: +mostly -quarter:q2 # freq=89401 AndMedNotQuarter: +interview -quarter:q3 # freq=94736 AndMedNotQuarter: +9 -quarter:q4 # freq=541405 AndMedNotQuarter: +hard -quarter:q1 # freq=92045 AndMedNotQuarter: +bay -quarter:q2 # freq=117167 ``` ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrNotHighHigh 574.77 (3.1%) 535.14 (3.2%) -6.9% ( -12% - 0%) 0.000 OrHighNotHigh 423.02 (3.6%) 396.20 (3.6%) -6.3% ( -13% - 0%) 0.000 OrHighNotMed 746.53 (3.7%) 702.90 (4.5%) -5.8% ( -13% - 2%) 0.000 OrHighNotLow 750.04 (4.4%) 712.78 (4.5%) -5.0% ( -13% - 4%) 0.000 OrNotHighMed 653.88 (3.0%) 624.44 (3.9%) -4.5% ( -11% - 2%) 0.000 OrNotHighLow 1205.35 (5.4%) 1185.91 (5.8%) -1.6% ( -12% - 10%) 0.363 PKLookup 269.07 (2.4%) 266.45 (3.0%) -1.0% ( -6% - 4%) 0.259 AndHighMed 146.24 (6.4%) 144.93 (6.1%) -0.9% ( -12% - 12%) 0.649 Wildcard 220.71 (3.3%) 218.75 (2.7%) -0.9% ( -6% - 5%) 0.352 HighTermTitleSort 148.58 (3.3%) 147.26 (2.7%) -0.9% ( -6% - 5%) 0.347 HighTermTitleBDVSort 35.47 (2.5%) 35.18 (2.0%) -0.8% ( -5% - 3%) 0.247 BrowseMonthTaxoFacets 35.20 (35.1%) 34.93 (38.6%) -0.8% ( -55% - 112%) 0.948 HighTerm 714.46 (4.1%) 710.25 (3.5%) -0.6% ( -7% - 7%) 0.626 MedTerm 1063.86 (4.3%) 1058.95 (3.4%) -0.5% ( -7% - 7%) 0.707 OrHighMed 192.38 (3.0%) 191.49 (2.6%) -0.5% ( -5% - 5%) 0.602 Prefix3 259.99 (5.9%) 258.85 (6.3%) -0.4% ( -11% - 12%) 0.821 OrHighHigh 39.21 (2.9%) 39.06 (2.2%) -0.4% ( -5% - 4%) 0.639 LowPhrase 65.71 (2.5%) 65.53 (2.6%) -0.3% ( -5% - 4%) 0.723 Respell 105.97 (2.6%) 105.72 (1.7%) -0.2% ( -4% - 4%) 0.740 MedTermDayTaxoFacets 38.03 (2.7%) 37.98 (2.2%) -0.1% ( -4% - 4%) 0.869 HighSpanNear 10.67 (2.5%) 10.66 (2.3%) -0.1% ( -4% - 4%) 0.865 MedPhrase 97.20 (3.3%) 97.11 (2.8%) -0.1% ( -6% - 6%) 0.931 AndHighLow 1597.01 (5.2%) 1596.89 (4.1%) -0.0% ( -8% - 9%) 0.996 AndHighHigh 64.16 (4.5%) 64.17 (4.0%) 0.0% ( -8% - 8%) 0.995 AndHighMedDayTaxoFacets 91.65 (1.6%) 91.73 (2.4%) 0.1% ( -3% - 4%) 0.894 Fuzzy2 101.73 (2.5%) 101.84 (2.5%) 0.1% ( -4% - 5%) 0.889 AndHighHighDayTaxoFacets 6.90 (3.0%) 6.91 (2.3%) 0.1% ( -4% - 5%) 0.890 HighSloppyPhrase 39.46 (3.9%) 39.54 (3.2%) 0.2% ( -6% - 7%) 0.870 OrHighMedDayTaxoFacets 22.70 (4.9%) 22.74 (5.7%) 0.2% ( -9% - 11%) 0.899 MedSpanNear 117.54 (2.8%) 117.84 (1.9%) 0.3% ( -4% - 5%) 0.735 Fuzzy1 189.31 (2.7%) 189.83 (2.8%) 0.3% ( -5% - 5%) 0.753 HighPhrase 92.14 (2.9%) 92.41 (2.5%) 0.3% ( -4% - 5%) 0.731 LowSpanNear 111.85 (2.1%) 112.29 (2.6%) 0.4% ( -4% - 5%) 0.594 OrHighLow 175.33 (4.6%) 176.33 (4.0%) 0.6% ( -7% - 9%) 0.675 LowSloppyPhrase 60.26 (3.4%) 60.65 (3.0%) 0.7% ( -5% - 7%) 0.522 MedSloppyPhrase 102.93 (3.3%) 103.90 (3.2%) 0.9% ( -5% - 7%) 0.362 MedIntervalsOrdered 18.60 (5.5%) 18.82 (6.5%) 1.2% ( -10% - 13%) 0.542 HighIntervalsOrdered 5.94 (8.4%) 6.01 (8.9%) 1.2% ( -14% - 20%) 0.671 IntNRQ 183.09 (7.7%) 185.85 (5.7%) 1.5% ( -11% - 16%) 0.482 HighTermMonthSort 3709.41 (5.6%) 3771.76 (6.9%) 1.7% ( -10% - 15%) 0.400 HighTermDayOfYearSort 492.50 (5.3%) 501.64 (3.4%) 1.9% ( -6% - 11%) 0.185 LowIntervalsOrdered 252.81 (8.1%) 258.30 (9.8%) 2.2% ( -14% - 21%) 0.446 LowTerm 1057.85 (6.4%) 1081.16 (8.1%) 2.2% ( -11% - 17%) 0.342 TermDTSort 253.74 (5.9%) 259.66 (4.5%) 2.3% ( -7% - 13%) 0.161 BrowseDateSSDVFacets 5.08 (21.2%) 5.28 (24.8%) 3.9% ( -34% - 63%) 0.589 BrowseDayOfYearSSDVFacets 26.14 (25.7%) 27.21 (34.3%) 4.1% ( -44% - 86%) 0.670 BrowseMonthSSDVFacets 26.25 (23.4%) 27.98 (30.7%) 6.6% ( -38% - 79%) 0.446 BrowseDayOfYearTaxoFacets 39.48 (35.1%) 42.26 (34.3%) 7.0% ( -46% - 117%) 0.522 BrowseDateTaxoFacets 39.49 (35.1%) 42.32 (34.3%) 7.2% ( -46% - 117%) 0.513 AndHighNotQuarter 147.80 (1.7%) 389.73 (9.5%) 163.7% ( 150% - 177%) 0.000 AndMedNotQuarter 112.31 (1.1%) 430.97 (13.3%) 283.7% ( 266% - 301%) 0.000 ``` I feel the result so far looks promising. With regard to the question I raised earlier: > As Lucene does a lot of two phase iterations, and two phase iterator's approximation may provide a superset of the actual matches. If we were to use this API to find and ignore / skip over a bunch of doc ids from approximation, wouldn't the result be inaccurate? maybe one "solution" could be to mark this API as expert only, and warn that any concrete implementation should provide an exact range rather than approximated range? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org