FWIW I have also seen some users store sparse vectors or bloom filters in binary doc values. In both cases, the serialized size may be non negligible while not all bytes are needed. This change would likely help.
Having the binary sort and faceting tasks not show a big slowdown would be good as these should be the worst case scenario for this change, as all bytes need to be read? @Ignacio Your luceneutil results show a couple significant speedups and small slowdowns but the p-values are high, which suggests that results are very noisy. I wonder if the benchmark had enough iterations or taskRepeatCount. Le jeu. 5 déc. 2024, 17:38, Ignacio Vera <iver...@gmail.com> a écrit : > @Cris: Agreeing on an off-heap BytesRef thingy would be a great step > forward. > > @Mike: Yes, there are other use cases. One that is close to my heart > is the geo use case where in many cases you don't need to read all the > bytes, and geometries can be big. In lucene there are some interesting > usages in the facets module which I already implemented in the PR. > Running the wikimedium benchmark on it (I think ) it shows an > improvement on the facets runs as well as some regressions: > > BrowseRandomLabelSSDVFacets 5.21 (12.5%) 5.01 > (8.8%) -3.8% ( -22% - 20%) 0.264 > OrHighMedDayTaxoFacets 4.78 (5.5%) 4.68 > (5.0%) -2.1% ( -11% - 8%) 0.211 > HighTermTitleBDVSort 9.11 (1.5%) 8.96 > (1.4%) -1.7% ( -4% - 1%) 0.000 > BrowseDayOfYearSSDVFacets 6.64 (15.7%) 6.54 > (15.1%) -1.5% ( -27% - 34%) 0.765 > BrowseMonthSSDVFacets 6.50 (11.8%) 6.42 > (11.6%) -1.3% ( -22% - 25%) 0.729 > HighTerm 565.97 (6.5%) 562.79 > (7.0%) -0.6% ( -13% - 13%) 0.793 > HighTermMonthSort 2350.90 (4.5%) 2338.02 > (4.0%) -0.5% ( -8% - 8%) 0.684 > AndHighHigh 55.14 (3.3%) 54.97 > (4.1%) -0.3% ( -7% - 7%) 0.803 > OrHighNotMed 406.21 (7.2%) 405.14 > (6.6%) -0.3% ( -13% - 14%) 0.904 > OrNotHighHigh 425.43 (3.5%) 424.78 > (3.3%) -0.2% ( -6% - 6%) 0.886 > MedTermDayTaxoFacets 30.90 (2.0%) 30.86 > (1.6%) -0.1% ( -3% - 3%) 0.834 > MedSloppyPhrase 32.77 (3.7%) 32.73 > (3.7%) -0.1% ( -7% - 7%) 0.921 > AndHighMed 184.44 (3.4%) 184.37 > (3.6%) -0.0% ( -6% - 7%) 0.969 > LowSloppyPhrase 52.47 (1.7%) 52.47 > (1.6%) 0.0% ( -3% - 3%) 0.996 > OrHighNotHigh 619.12 (5.0%) 619.46 > (4.6%) 0.1% ( -9% - 10%) 0.971 > OrHighNotLow 567.45 (6.5%) 568.21 > (6.2%) 0.1% ( -11% - 13%) 0.947 > PKLookup 275.72 (2.3%) 276.13 > (3.3%) 0.1% ( -5% - 5%) 0.872 > LowIntervalsOrdered 6.15 (2.1%) 6.16 > (2.5%) 0.1% ( -4% - 4%) 0.836 > IntNRQ 76.94 (5.8%) 77.11 > (4.5%) 0.2% ( -9% - 11%) 0.895 > HighSloppyPhrase 2.32 (2.7%) 2.33 > (2.3%) 0.3% ( -4% - 5%) 0.685 > LowTerm 629.84 (3.5%) 632.74 > (3.3%) 0.5% ( -6% - 7%) 0.670 > LowSpanNear 99.79 (2.7%) 100.30 > (3.2%) 0.5% ( -5% - 6%) 0.589 > MedTerm 889.05 (4.2%) 893.75 > (4.6%) 0.5% ( -7% - 9%) 0.703 > OrNotHighMed 361.55 (3.2%) 363.50 > (3.1%) 0.5% ( -5% - 7%) 0.591 > Prefix3 134.42 (4.3%) 135.18 > (3.8%) 0.6% ( -7% - 8%) 0.656 > HighTermTitleSort 188.27 (2.1%) 189.35 > (2.5%) 0.6% ( -3% - 5%) 0.423 > HighIntervalsOrdered 7.97 (4.8%) 8.02 > (6.0%) 0.6% ( -9% - 11%) 0.736 > Wildcard 57.19 (2.9%) 57.53 > (3.1%) 0.6% ( -5% - 6%) 0.525 > OrHighLow 551.58 (2.9%) 555.28 > (2.5%) 0.7% ( -4% - 6%) 0.436 > MedIntervalsOrdered 29.22 (4.9%) 29.41 > (6.1%) 0.7% ( -9% - 12%) 0.697 > MedPhrase 30.11 (2.1%) 30.32 > (1.5%) 0.7% ( -2% - 4%) 0.241 > OrHighHigh 54.77 (6.7%) 55.15 > (5.2%) 0.7% ( -10% - 13%) 0.714 > Fuzzy1 108.14 (2.8%) 108.90 > (2.4%) 0.7% ( -4% - 6%) 0.403 > OrHighMed 182.51 (5.4%) 183.80 > (3.4%) 0.7% ( -7% - 10%) 0.622 > AndHighMedDayTaxoFacets 30.18 (3.1%) 30.40 > (2.3%) 0.7% ( -4% - 6%) 0.403 > HighTermDayOfYearSort 462.68 (3.7%) 466.03 > (3.6%) 0.7% ( -6% - 8%) 0.532 > AndHighLow 1225.05 (5.2%) 1233.95 > (4.5%) 0.7% ( -8% - 10%) 0.636 > MedSpanNear 13.85 (2.2%) 13.95 > (2.0%) 0.7% ( -3% - 5%) 0.264 > LowPhrase 204.19 (2.6%) 205.88 > (1.9%) 0.8% ( -3% - 5%) 0.247 > HighPhrase 105.85 (3.1%) 106.80 > (2.6%) 0.9% ( -4% - 6%) 0.322 > Fuzzy2 22.92 (2.6%) 23.13 > (2.1%) 0.9% ( -3% - 5%) 0.233 > TermDTSort 295.84 (7.3%) 298.66 > (6.6%) 1.0% ( -12% - 16%) 0.665 > Respell 78.37 (2.3%) 79.15 > (1.8%) 1.0% ( -2% - 5%) 0.125 > AndHighHighDayTaxoFacets 2.70 (4.8%) 2.72 > (2.5%) 1.0% ( -6% - 8%) 0.407 > OrNotHighLow 1134.11 (3.2%) 1146.96 > (3.8%) 1.1% ( -5% - 8%) 0.310 > HighSpanNear 3.88 (7.1%) 3.95 > (4.9%) 1.7% ( -9% - 14%) 0.376 > range 5910.33 (9.7%) 6049.55 > (8.0%) 2.4% ( -14% - 22%) 0.403 > BrowseDateSSDVFacets 1.19 (14.3%) 1.24 > (19.0%) 4.1% ( -25% - 43%) 0.446 > BrowseDateTaxoFacets 6.67 (4.6%) 7.08 > (24.2%) 6.1% ( -21% - 36%) 0.264 > BrowseDayOfYearTaxoFacets 6.74 (4.9%) 7.17 > (23.8%) 6.4% ( -21% - 36%) 0.237 > BrowseRandomLabelTaxoFacets 5.39 (3.7%) 6.02 > (52.8%) 11.7% ( -43% - 70%) 0.322 > BrowseMonthTaxoFacets 8.20 (35.8%) 9.48 > (37.2%) 15.6% ( -42% - 138%) 0.177 > > > > On Thu, Dec 5, 2024 at 2:07 PM Michael Sokolov <msoko...@gmail.com> wrote: > > > > That makes sense to me too in the abstract. At Amazon we also have > > interesting BDV fields we have to decode on the fly, so this looks > > attractive for that reason (not just faceting). > > > > I would say though that it would be easier to evaluate the fitness for > > purpose (faceting) if we had some examples of BinaryDocValues used for > > faceting (or otherwise being decoded on the fly) in the Lucene code > > base -- do we have that? I'd be concerned if we're not able to fully > > test the new functionality to see what the impact of any changes might > > be. > > > > On Thu, Dec 5, 2024 at 6:45 AM Chris Hegarty > > <christopher.hega...@elastic.co.invalid> wrote: > > > > > > Hi Ignacio, > > > > > > I completely agree with the idea of having a BytesRef-like thing that > can be off-heap. For a while now I’ve been thinking about how we could > evolve BytesRef so as to not expose its on-heap representation. Having a > separate primitive is probably a better way to go. > > > > > > -Chris. > > > > > > > On 5 Dec 2024, at 10:42, Ignacio Vera <iver...@gmail.com> wrote: > > > > > > > > Hello, > > > > > > > > I have been working with the idea of reading binary doc values > > > > off-heap for a while. The idea behind it is that binary doc values > are > > > > often used for faceting where structure data is encoded at write time > > > > and decoded at read time. It feels wasteful to have to read the data > > > > on-heap before decoding it when we can read the data directly from > the > > > > off-heap buffer. > > > > > > > > The current proposal is to evolve the current API from an on-heap > data > > > > structure (BytesRef) to an off-heap data structure (currently named > > > > RandomAccessInputRef). Because we are currently reading the data into > > > > the buffer using a RandomAccessInput with an offset and a length, it > > > > feels very natural to create an off-heap equivalent to BytesRef that > > > > is backed by a RandomAccessInput. > > > > > > > > I am hoping to move this idea forward so I am asking for feedback as > > > > this is a change on a public API so I would love to hear other > > > > opinions. > > > > > > > > Thank you! > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >