msokolov commented on issue #13147: URL: https://github.com/apache/lucene/issues/13147#issuecomment-1975201779
I ran luceneutil over wikimediumall. The index size was slightly reduced: ``` 65200 ../indices/baseline/facets 18923720 ../indices/baseline/index 18988924 ../indices/baseline 65204 ../indices/candidate/facets 18774956 ../indices/candidate/index 18840164 ../indices/candidate ``` in a microbenchmark where I indexed random doc-only postings I saw ~28% index size reduction. query performance does seem to have registered some actual change: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value [178/1805] OrHighNotLow 124.19 (6.0%) 111.98 (6.7%) -9.8% ( -21% - 3%) 0.000 LowSpanNear 1.50 (1.1%) 1.42 (1.2%) -4.8% ( -7% - -2%) 0.000 HighTermTitleSort 86.63 (3.0%) 82.70 (2.2%) -4.5% ( -9% - 0%) 0.000 MedIntervalsOrdered 3.25 (4.3%) 3.11 (4.3%) -4.1% ( -12% - 4%) 0.003 OrHighHigh 23.47 (6.7%) 22.61 (3.3%) -3.7% ( -12% - 6%) 0.029 LowIntervalsOrdered 4.20 (4.1%) 4.05 (4.1%) -3.5% ( -11% - 4%) 0.007 AndHighHigh 25.46 (8.5%) 24.57 (4.9%) -3.5% ( -15% - 10%) 0.114 BrowseRandomLabelTaxoFacets 2.05 (14.8%) 1.98 (11.0%) -3.4% ( -25% - 26%) 0.405 HighIntervalsOrdered 2.09 (5.3%) 2.02 (5.4%) -3.1% ( -13% - 7%) 0.063 HighSpanNear 4.25 (1.9%) 4.13 (2.0%) -2.8% ( -6% - 1%) 0.000 OrHighMed 43.34 (3.1%) 42.18 (2.1%) -2.7% ( -7% - 2%) 0.001 BrowseDateTaxoFacets 2.78 (7.6%) 2.70 (6.6%) -2.7% ( -15% - 12%) 0.234 BrowseDayOfYearTaxoFacets 2.81 (7.2%) 2.74 (6.2%) -2.5% ( -14% - 11%) 0.236 Prefix3 126.88 (2.3%) 123.78 (3.5%) -2.4% ( -8% - 3%) 0.009 MedSpanNear 11.93 (0.9%) 11.65 (1.1%) -2.3% ( -4% - 0%) 0.000 OrHighNotMed 141.45 (5.1%) 138.33 (7.0%) -2.2% ( -13% - 10%) 0.254 AndHighMed 36.62 (5.6%) 35.82 (3.1%) -2.2% ( -10% - 6%) 0.124 MedPhrase 67.69 (2.9%) 66.22 (2.6%) -2.2% ( -7% - 3%) 0.013 HighSloppyPhrase 10.38 (1.6%) 10.20 (1.5%) -1.8% ( -4% - 1%) 0.000 IntNRQ 8.57 (14.4%) 8.42 (16.1%) -1.8% ( -28% - 33%) 0.713 HighTerm 271.19 (4.0%) 266.87 (5.1%) -1.6% ( -10% - 7%) 0.271 MedSloppyPhrase 8.12 (1.9%) 8.00 (2.5%) -1.6% ( -5% - 2%) 0.028 HighPhrase 39.43 (3.8%) 38.94 (3.1%) -1.2% ( -7% - 5%) 0.257 MedTerm 235.50 (3.4%) 232.58 (4.7%) -1.2% ( -9% - 7%) 0.339 LowPhrase 46.81 (2.8%) 46.27 (2.3%) -1.2% ( -6% - 4%) 0.157 OrHighNotHigh 147.42 (4.7%) 145.78 (6.2%) -1.1% ( -11% - 10%) 0.525 TermDTSort 88.33 (2.8%) 87.38 (1.8%) -1.1% ( -5% - 3%) 0.151 HighTermDayOfYearSort 152.37 (2.1%) 150.79 (1.8%) -1.0% ( -4% - 2%) 0.093 LowTerm 254.01 (1.9%) 251.72 (2.6%) -0.9% ( -5% - 3%) 0.207 LowSloppyPhrase 24.52 (0.9%) 24.32 (1.4%) -0.8% ( -3% - 1%) 0.029 OrNotHighHigh 199.37 (3.8%) 197.74 (4.9%) -0.8% ( -9% - 8%) 0.557 HighTermMonthSort 1581.75 (2.6%) 1569.14 (2.1%) -0.8% ( -5% - 4%) 0.292 OrNotHighMed 134.43 (2.7%) 133.51 (3.3%) -0.7% ( -6% - 5%) 0.471 OrHighLow 279.41 (2.1%) 277.84 (2.2%) -0.6% ( -4% - 3%) 0.412 Fuzzy1 64.73 (1.5%) 64.48 (0.7%) -0.4% ( -2% - 1%) 0.302 OrHighMedDayTaxoFacets 3.84 (6.3%) 3.83 (5.4%) -0.4% ( -11% - 12%) 0.845 AndHighMedDayTaxoFacets 31.84 (1.2%) 31.74 (1.5%) -0.3% ( -2% - 2%) 0.444 Fuzzy2 36.90 (1.3%) 36.80 (0.8%) -0.3% ( -2% - 1%) 0.383 BrowseRandomLabelSSDVFacets 1.57 (5.5%) 1.57 (3.8%) -0.2% ( -9% - 9%) 0.906 PKLookup 140.43 (1.7%) 140.30 (2.1%) -0.1% ( -3% - 3%) 0.876 AndHighLow 279.44 (2.2%) 279.34 (2.3%) -0.0% ( -4% - 4%) 0.958 OrNotHighLow 345.34 (1.7%) 345.21 (1.9%) -0.0% ( -3% - 3%) 0.948 Respell 33.36 (1.5%) 33.38 (1.3%) 0.1% ( -2% - 2%) 0.881 MedTermDayTaxoFacets 10.12 (2.4%) 10.13 (2.4%) 0.1% ( -4% - 4%) 0.912 BrowseDayOfYearSSDVFacets 2.32 (5.4%) 2.33 (3.3%) 0.1% ( -8% - 9%) 0.953 HighTermTitleBDVSort 4.74 (3.3%) 4.74 (4.0%) 0.1% ( -6% - 7%) 0.902 Wildcard 136.61 (2.5%) 136.82 (2.2%) 0.2% ( -4% - 4%) 0.831 BrowseDateSSDVFacets 0.68 (13.1%) 0.68 (13.0%) 0.4% ( -22% - 30%) 0.928 BrowseMonthTaxoFacets 2.84 (3.8%) 2.87 (1.3%) 1.1% ( -3% - 6%) 0.207 BrowseMonthSSDVFacets 2.38 (5.1%) 2.41 (4.0%) 1.3% ( -7% - 11%) 0.362 AndHighHighDayTaxoFacets 3.23 (3.6%) 3.34 (2.9%) 3.5% ( -2% - 10%) 0.001 ``` so this looks positive. I can try tuning the decision parameter controlling which encoding to use to see what impact that may have. I guess what I wonder is whether the added complexity is worth chasing this, but I'm pretty encouraged that the overhead of the conditionals isn't overwhelming the "within-block skipping" this affords. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org