I have run the luceneutil benchmark with higher iterations and repeat count but they are still very noisy, which I blame for running those benchmarks on a laptop.
The results always show some of the facets tasks having speed ups while others having small slowdowns. One run that clearly shows a slowdown is HighTermTitleBDVSort which I expect as we are reading those bytes on heap using a BytesRefBuilder now. The only way to prevent this slow down would be to make the off-heap bytesref thingy to be able to implement the Comparable interface efficiently or alternatively expose the doc values max length so implementations can do the same as we are doing today. TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value HighTermTitleBDVSort 4.30 (2.8%) 4.19 (2.0%) -2.5% ( -7% - 2%) 0.000 BrowseMonthTaxoFacets 10.25 (26.1%) 10.06 (26.4%) -1.8% ( -43% - 68%) 0.623 TermDTSort 216.72 (7.3%) 213.19 (6.7%) -1.6% ( -14% - 13%) 0.100 OrNotHighMed 464.01 (3.5%) 459.77 (3.5%) -0.9% ( -7% - 6%) 0.066 Wildcard 305.50 (4.1%) 303.08 (3.9%) -0.8% ( -8% - 7%) 0.164 Respell 60.92 (2.5%) 60.51 (2.4%) -0.7% ( -5% - 4%) 0.055 AndHighLow 1406.33 (3.1%) 1397.09 (3.3%) -0.7% ( -6% - 5%) 0.151 OrHighNotHigh 435.53 (4.8%) 432.71 (5.6%) -0.6% ( -10% - 10%) 0.383 LowTerm 931.83 (3.5%) 925.97 (3.2%) -0.6% ( -7% - 6%) 0.185 HighTermDayOfYearSort 410.08 (4.9%) 407.77 (4.9%) -0.6% ( -9% - 9%) 0.416 AndHighHighDayTaxoFacets 6.45 (2.1%) 6.42 (2.4%) -0.6% ( -4% - 3%) 0.081 MedPhrase 117.99 (2.7%) 117.34 (2.7%) -0.6% ( -5% - 4%) 0.142 LowPhrase 24.97 (2.9%) 24.84 (3.1%) -0.5% ( -6% - 5%) 0.210 AndHighMedDayTaxoFacets 18.40 (2.0%) 18.31 (2.3%) -0.5% ( -4% - 3%) 0.095 HighTerm 793.15 (4.4%) 789.25 (5.2%) -0.5% ( -9% - 9%) 0.469 OrHighMedDayTaxoFacets 4.64 (3.3%) 4.62 (3.7%) -0.5% ( -7% - 6%) 0.334 OrHighMed 188.68 (2.6%) 187.79 (3.1%) -0.5% ( -6% - 5%) 0.244 OrNotHighLow 1277.67 (2.5%) 1271.99 (2.9%) -0.4% ( -5% - 5%) 0.245 HighPhrase 59.28 (3.6%) 59.03 (3.4%) -0.4% ( -7% - 6%) 0.402 Fuzzy2 82.34 (2.2%) 82.02 (2.2%) -0.4% ( -4% - 4%) 0.205 OrNotHighHigh 688.74 (3.8%) 686.08 (4.4%) -0.4% ( -8% - 8%) 0.507 MedTermDayTaxoFacets 10.69 (3.2%) 10.65 (2.9%) -0.4% ( -6% - 5%) 0.382 OrHighNotMed 722.62 (4.8%) 720.03 (5.7%) -0.4% ( -10% - 10%) 0.630 MedTerm 979.84 (3.5%) 976.55 (4.0%) -0.3% ( -7% - 7%) 0.528 OrHighLow 780.09 (2.7%) 777.60 (2.8%) -0.3% ( -5% - 5%) 0.413 OrHighNotLow 639.07 (5.1%) 637.34 (6.0%) -0.3% ( -10% - 11%) 0.728 Prefix3 371.35 (2.3%) 370.43 (1.9%) -0.2% ( -4% - 4%) 0.409 AndHighMed 127.87 (2.4%) 127.63 (2.8%) -0.2% ( -5% - 5%) 0.613 HighIntervalsOrdered 4.88 (5.7%) 4.88 (5.8%) -0.1% ( -11% - 12%) 0.882 HighTermMonthSort 2844.77 (2.7%) 2841.57 (2.8%) -0.1% ( -5% - 5%) 0.770 OrHighHigh 56.75 (1.8%) 56.70 (2.1%) -0.1% ( -3% - 3%) 0.793 MedIntervalsOrdered 7.42 (3.7%) 7.41 (3.8%) -0.1% ( -7% - 7%) 0.894 LowIntervalsOrdered 22.00 (3.6%) 21.99 (3.6%) -0.0% ( -7% - 7%) 0.930 MedSloppyPhrase 34.10 (2.3%) 34.11 (2.3%) 0.0% ( -4% - 4%) 0.931 range 7621.67 (4.4%) 7625.28 (3.8%) 0.0% ( -7% - 8%) 0.935 HighTermTitleSort 208.38 (2.7%) 208.52 (2.6%) 0.1% ( -5% - 5%) 0.858 LowSpanNear 9.95 (1.4%) 9.96 (1.5%) 0.1% ( -2% - 3%) 0.668 Fuzzy1 70.10 (2.4%) 70.23 (2.6%) 0.2% ( -4% - 5%) 0.616 PKLookup 262.66 (3.9%) 263.16 (4.3%) 0.2% ( -7% - 8%) 0.743 HighSloppyPhrase 7.09 (3.7%) 7.11 (4.2%) 0.2% ( -7% - 8%) 0.732 AndHighHigh 49.37 (2.2%) 49.47 (2.5%) 0.2% ( -4% - 4%) 0.538 LowSloppyPhrase 186.55 (6.7%) 186.96 (6.7%) 0.2% ( -12% - 14%) 0.813 MedSpanNear 33.62 (2.6%) 33.71 (2.8%) 0.2% ( -5% - 5%) 0.523 IntNRQ 25.87 (5.7%) 25.94 (5.0%) 0.3% ( -9% - 11%) 0.713 HighSpanNear 11.58 (3.1%) 11.61 (3.2%) 0.3% ( -5% - 6%) 0.528 BrowseDayOfYearTaxoFacets 6.98 (8.0%) 7.08 (12.6%) 1.4% ( -17% - 23%) 0.333 BrowseDateTaxoFacets 6.90 (8.2%) 7.01 (12.8%) 1.5% ( -17% - 24%) 0.321 BrowseRandomLabelSSDVFacets 5.04 (8.6%) 5.15 (11.5%) 2.1% ( -16% - 24%) 0.148 BrowseDateSSDVFacets 1.37 (15.9%) 1.40 (16.0%) 2.1% ( -25% - 40%) 0.341 BrowseDayOfYearSSDVFacets 6.24 (10.4%) 6.39 (12.9%) 2.4% ( -18% - 28%) 0.148 BrowseRandomLabelTaxoFacets 5.65 (8.3%) 5.80 (22.2%) 2.5% ( -25% - 35%) 0.286 BrowseMonthSSDVFacets 6.24 (10.7%) 6.40 (13.4%) 2.6% ( -19% - 29%) 0.136 On Sat, Dec 7, 2024 at 9:08 PM Adrien Grand <jpou...@gmail.com> wrote: > > FWIW I have also seen some users store sparse vectors or bloom filters in > binary doc values. In both cases, the serialized size may be non negligible > while not all bytes are needed. This change would likely help. > > Having the binary sort and faceting tasks not show a big slowdown would be > good as these should be the worst case scenario for this change, as all bytes > need to be read? > > @Ignacio Your luceneutil results show a couple significant speedups and small > slowdowns but the p-values are high, which suggests that results are very > noisy. I wonder if the benchmark had enough iterations or taskRepeatCount. > > Le jeu. 5 déc. 2024, 17:38, Ignacio Vera <iver...@gmail.com> a écrit : >> >> @Cris: Agreeing on an off-heap BytesRef thingy would be a great step forward. >> >> @Mike: Yes, there are other use cases. One that is close to my heart >> is the geo use case where in many cases you don't need to read all the >> bytes, and geometries can be big. In lucene there are some interesting >> usages in the facets module which I already implemented in the PR. >> Running the wikimedium benchmark on it (I think ) it shows an >> improvement on the facets runs as well as some regressions: >> >> BrowseRandomLabelSSDVFacets 5.21 (12.5%) 5.01 >> (8.8%) -3.8% ( -22% - 20%) 0.264 >> OrHighMedDayTaxoFacets 4.78 (5.5%) 4.68 >> (5.0%) -2.1% ( -11% - 8%) 0.211 >> HighTermTitleBDVSort 9.11 (1.5%) 8.96 >> (1.4%) -1.7% ( -4% - 1%) 0.000 >> BrowseDayOfYearSSDVFacets 6.64 (15.7%) 6.54 >> (15.1%) -1.5% ( -27% - 34%) 0.765 >> BrowseMonthSSDVFacets 6.50 (11.8%) 6.42 >> (11.6%) -1.3% ( -22% - 25%) 0.729 >> HighTerm 565.97 (6.5%) 562.79 >> (7.0%) -0.6% ( -13% - 13%) 0.793 >> HighTermMonthSort 2350.90 (4.5%) 2338.02 >> (4.0%) -0.5% ( -8% - 8%) 0.684 >> AndHighHigh 55.14 (3.3%) 54.97 >> (4.1%) -0.3% ( -7% - 7%) 0.803 >> OrHighNotMed 406.21 (7.2%) 405.14 >> (6.6%) -0.3% ( -13% - 14%) 0.904 >> OrNotHighHigh 425.43 (3.5%) 424.78 >> (3.3%) -0.2% ( -6% - 6%) 0.886 >> MedTermDayTaxoFacets 30.90 (2.0%) 30.86 >> (1.6%) -0.1% ( -3% - 3%) 0.834 >> MedSloppyPhrase 32.77 (3.7%) 32.73 >> (3.7%) -0.1% ( -7% - 7%) 0.921 >> AndHighMed 184.44 (3.4%) 184.37 >> (3.6%) -0.0% ( -6% - 7%) 0.969 >> LowSloppyPhrase 52.47 (1.7%) 52.47 >> (1.6%) 0.0% ( -3% - 3%) 0.996 >> OrHighNotHigh 619.12 (5.0%) 619.46 >> (4.6%) 0.1% ( -9% - 10%) 0.971 >> OrHighNotLow 567.45 (6.5%) 568.21 >> (6.2%) 0.1% ( -11% - 13%) 0.947 >> PKLookup 275.72 (2.3%) 276.13 >> (3.3%) 0.1% ( -5% - 5%) 0.872 >> LowIntervalsOrdered 6.15 (2.1%) 6.16 >> (2.5%) 0.1% ( -4% - 4%) 0.836 >> IntNRQ 76.94 (5.8%) 77.11 >> (4.5%) 0.2% ( -9% - 11%) 0.895 >> HighSloppyPhrase 2.32 (2.7%) 2.33 >> (2.3%) 0.3% ( -4% - 5%) 0.685 >> LowTerm 629.84 (3.5%) 632.74 >> (3.3%) 0.5% ( -6% - 7%) 0.670 >> LowSpanNear 99.79 (2.7%) 100.30 >> (3.2%) 0.5% ( -5% - 6%) 0.589 >> MedTerm 889.05 (4.2%) 893.75 >> (4.6%) 0.5% ( -7% - 9%) 0.703 >> OrNotHighMed 361.55 (3.2%) 363.50 >> (3.1%) 0.5% ( -5% - 7%) 0.591 >> Prefix3 134.42 (4.3%) 135.18 >> (3.8%) 0.6% ( -7% - 8%) 0.656 >> HighTermTitleSort 188.27 (2.1%) 189.35 >> (2.5%) 0.6% ( -3% - 5%) 0.423 >> HighIntervalsOrdered 7.97 (4.8%) 8.02 >> (6.0%) 0.6% ( -9% - 11%) 0.736 >> Wildcard 57.19 (2.9%) 57.53 >> (3.1%) 0.6% ( -5% - 6%) 0.525 >> OrHighLow 551.58 (2.9%) 555.28 >> (2.5%) 0.7% ( -4% - 6%) 0.436 >> MedIntervalsOrdered 29.22 (4.9%) 29.41 >> (6.1%) 0.7% ( -9% - 12%) 0.697 >> MedPhrase 30.11 (2.1%) 30.32 >> (1.5%) 0.7% ( -2% - 4%) 0.241 >> OrHighHigh 54.77 (6.7%) 55.15 >> (5.2%) 0.7% ( -10% - 13%) 0.714 >> Fuzzy1 108.14 (2.8%) 108.90 >> (2.4%) 0.7% ( -4% - 6%) 0.403 >> OrHighMed 182.51 (5.4%) 183.80 >> (3.4%) 0.7% ( -7% - 10%) 0.622 >> AndHighMedDayTaxoFacets 30.18 (3.1%) 30.40 >> (2.3%) 0.7% ( -4% - 6%) 0.403 >> HighTermDayOfYearSort 462.68 (3.7%) 466.03 >> (3.6%) 0.7% ( -6% - 8%) 0.532 >> AndHighLow 1225.05 (5.2%) 1233.95 >> (4.5%) 0.7% ( -8% - 10%) 0.636 >> MedSpanNear 13.85 (2.2%) 13.95 >> (2.0%) 0.7% ( -3% - 5%) 0.264 >> LowPhrase 204.19 (2.6%) 205.88 >> (1.9%) 0.8% ( -3% - 5%) 0.247 >> HighPhrase 105.85 (3.1%) 106.80 >> (2.6%) 0.9% ( -4% - 6%) 0.322 >> Fuzzy2 22.92 (2.6%) 23.13 >> (2.1%) 0.9% ( -3% - 5%) 0.233 >> TermDTSort 295.84 (7.3%) 298.66 >> (6.6%) 1.0% ( -12% - 16%) 0.665 >> Respell 78.37 (2.3%) 79.15 >> (1.8%) 1.0% ( -2% - 5%) 0.125 >> AndHighHighDayTaxoFacets 2.70 (4.8%) 2.72 >> (2.5%) 1.0% ( -6% - 8%) 0.407 >> OrNotHighLow 1134.11 (3.2%) 1146.96 >> (3.8%) 1.1% ( -5% - 8%) 0.310 >> HighSpanNear 3.88 (7.1%) 3.95 >> (4.9%) 1.7% ( -9% - 14%) 0.376 >> range 5910.33 (9.7%) 6049.55 >> (8.0%) 2.4% ( -14% - 22%) 0.403 >> BrowseDateSSDVFacets 1.19 (14.3%) 1.24 >> (19.0%) 4.1% ( -25% - 43%) 0.446 >> BrowseDateTaxoFacets 6.67 (4.6%) 7.08 >> (24.2%) 6.1% ( -21% - 36%) 0.264 >> BrowseDayOfYearTaxoFacets 6.74 (4.9%) 7.17 >> (23.8%) 6.4% ( -21% - 36%) 0.237 >> BrowseRandomLabelTaxoFacets 5.39 (3.7%) 6.02 >> (52.8%) 11.7% ( -43% - 70%) 0.322 >> BrowseMonthTaxoFacets 8.20 (35.8%) 9.48 >> (37.2%) 15.6% ( -42% - 138%) 0.177 >> >> >> >> On Thu, Dec 5, 2024 at 2:07 PM Michael Sokolov <msoko...@gmail.com> wrote: >> > >> > That makes sense to me too in the abstract. At Amazon we also have >> > interesting BDV fields we have to decode on the fly, so this looks >> > attractive for that reason (not just faceting). >> > >> > I would say though that it would be easier to evaluate the fitness for >> > purpose (faceting) if we had some examples of BinaryDocValues used for >> > faceting (or otherwise being decoded on the fly) in the Lucene code >> > base -- do we have that? I'd be concerned if we're not able to fully >> > test the new functionality to see what the impact of any changes might >> > be. >> > >> > On Thu, Dec 5, 2024 at 6:45 AM Chris Hegarty >> > <christopher.hega...@elastic.co.invalid> wrote: >> > > >> > > Hi Ignacio, >> > > >> > > I completely agree with the idea of having a BytesRef-like thing that >> > > can be off-heap. For a while now I’ve been thinking about how we could >> > > evolve BytesRef so as to not expose its on-heap representation. Having a >> > > separate primitive is probably a better way to go. >> > > >> > > -Chris. >> > > >> > > > On 5 Dec 2024, at 10:42, Ignacio Vera <iver...@gmail.com> wrote: >> > > > >> > > > Hello, >> > > > >> > > > I have been working with the idea of reading binary doc values >> > > > off-heap for a while. The idea behind it is that binary doc values are >> > > > often used for faceting where structure data is encoded at write time >> > > > and decoded at read time. It feels wasteful to have to read the data >> > > > on-heap before decoding it when we can read the data directly from the >> > > > off-heap buffer. >> > > > >> > > > The current proposal is to evolve the current API from an on-heap data >> > > > structure (BytesRef) to an off-heap data structure (currently named >> > > > RandomAccessInputRef). Because we are currently reading the data into >> > > > the buffer using a RandomAccessInput with an offset and a length, it >> > > > feels very natural to create an off-heap equivalent to BytesRef that >> > > > is backed by a RandomAccessInput. >> > > > >> > > > I am hoping to move this idea forward so I am asking for feedback as >> > > > this is a change on a public API so I would love to hear other >> > > > opinions. >> > > > >> > > > Thank you! >> > > > >> > > > --------------------------------------------------------------------- >> > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > > > For additional commands, e-mail: dev-h...@lucene.apache.org >> > > > >> > > >> > > >> > > --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: dev-h...@lucene.apache.org >> > > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org