@Cris: Agreeing on an off-heap BytesRef thingy would be a great step forward.
@Mike: Yes, there are other use cases. One that is close to my heart is the geo use case where in many cases you don't need to read all the bytes, and geometries can be big. In lucene there are some interesting usages in the facets module which I already implemented in the PR. Running the wikimedium benchmark on it (I think ) it shows an improvement on the facets runs as well as some regressions: BrowseRandomLabelSSDVFacets 5.21 (12.5%) 5.01 (8.8%) -3.8% ( -22% - 20%) 0.264 OrHighMedDayTaxoFacets 4.78 (5.5%) 4.68 (5.0%) -2.1% ( -11% - 8%) 0.211 HighTermTitleBDVSort 9.11 (1.5%) 8.96 (1.4%) -1.7% ( -4% - 1%) 0.000 BrowseDayOfYearSSDVFacets 6.64 (15.7%) 6.54 (15.1%) -1.5% ( -27% - 34%) 0.765 BrowseMonthSSDVFacets 6.50 (11.8%) 6.42 (11.6%) -1.3% ( -22% - 25%) 0.729 HighTerm 565.97 (6.5%) 562.79 (7.0%) -0.6% ( -13% - 13%) 0.793 HighTermMonthSort 2350.90 (4.5%) 2338.02 (4.0%) -0.5% ( -8% - 8%) 0.684 AndHighHigh 55.14 (3.3%) 54.97 (4.1%) -0.3% ( -7% - 7%) 0.803 OrHighNotMed 406.21 (7.2%) 405.14 (6.6%) -0.3% ( -13% - 14%) 0.904 OrNotHighHigh 425.43 (3.5%) 424.78 (3.3%) -0.2% ( -6% - 6%) 0.886 MedTermDayTaxoFacets 30.90 (2.0%) 30.86 (1.6%) -0.1% ( -3% - 3%) 0.834 MedSloppyPhrase 32.77 (3.7%) 32.73 (3.7%) -0.1% ( -7% - 7%) 0.921 AndHighMed 184.44 (3.4%) 184.37 (3.6%) -0.0% ( -6% - 7%) 0.969 LowSloppyPhrase 52.47 (1.7%) 52.47 (1.6%) 0.0% ( -3% - 3%) 0.996 OrHighNotHigh 619.12 (5.0%) 619.46 (4.6%) 0.1% ( -9% - 10%) 0.971 OrHighNotLow 567.45 (6.5%) 568.21 (6.2%) 0.1% ( -11% - 13%) 0.947 PKLookup 275.72 (2.3%) 276.13 (3.3%) 0.1% ( -5% - 5%) 0.872 LowIntervalsOrdered 6.15 (2.1%) 6.16 (2.5%) 0.1% ( -4% - 4%) 0.836 IntNRQ 76.94 (5.8%) 77.11 (4.5%) 0.2% ( -9% - 11%) 0.895 HighSloppyPhrase 2.32 (2.7%) 2.33 (2.3%) 0.3% ( -4% - 5%) 0.685 LowTerm 629.84 (3.5%) 632.74 (3.3%) 0.5% ( -6% - 7%) 0.670 LowSpanNear 99.79 (2.7%) 100.30 (3.2%) 0.5% ( -5% - 6%) 0.589 MedTerm 889.05 (4.2%) 893.75 (4.6%) 0.5% ( -7% - 9%) 0.703 OrNotHighMed 361.55 (3.2%) 363.50 (3.1%) 0.5% ( -5% - 7%) 0.591 Prefix3 134.42 (4.3%) 135.18 (3.8%) 0.6% ( -7% - 8%) 0.656 HighTermTitleSort 188.27 (2.1%) 189.35 (2.5%) 0.6% ( -3% - 5%) 0.423 HighIntervalsOrdered 7.97 (4.8%) 8.02 (6.0%) 0.6% ( -9% - 11%) 0.736 Wildcard 57.19 (2.9%) 57.53 (3.1%) 0.6% ( -5% - 6%) 0.525 OrHighLow 551.58 (2.9%) 555.28 (2.5%) 0.7% ( -4% - 6%) 0.436 MedIntervalsOrdered 29.22 (4.9%) 29.41 (6.1%) 0.7% ( -9% - 12%) 0.697 MedPhrase 30.11 (2.1%) 30.32 (1.5%) 0.7% ( -2% - 4%) 0.241 OrHighHigh 54.77 (6.7%) 55.15 (5.2%) 0.7% ( -10% - 13%) 0.714 Fuzzy1 108.14 (2.8%) 108.90 (2.4%) 0.7% ( -4% - 6%) 0.403 OrHighMed 182.51 (5.4%) 183.80 (3.4%) 0.7% ( -7% - 10%) 0.622 AndHighMedDayTaxoFacets 30.18 (3.1%) 30.40 (2.3%) 0.7% ( -4% - 6%) 0.403 HighTermDayOfYearSort 462.68 (3.7%) 466.03 (3.6%) 0.7% ( -6% - 8%) 0.532 AndHighLow 1225.05 (5.2%) 1233.95 (4.5%) 0.7% ( -8% - 10%) 0.636 MedSpanNear 13.85 (2.2%) 13.95 (2.0%) 0.7% ( -3% - 5%) 0.264 LowPhrase 204.19 (2.6%) 205.88 (1.9%) 0.8% ( -3% - 5%) 0.247 HighPhrase 105.85 (3.1%) 106.80 (2.6%) 0.9% ( -4% - 6%) 0.322 Fuzzy2 22.92 (2.6%) 23.13 (2.1%) 0.9% ( -3% - 5%) 0.233 TermDTSort 295.84 (7.3%) 298.66 (6.6%) 1.0% ( -12% - 16%) 0.665 Respell 78.37 (2.3%) 79.15 (1.8%) 1.0% ( -2% - 5%) 0.125 AndHighHighDayTaxoFacets 2.70 (4.8%) 2.72 (2.5%) 1.0% ( -6% - 8%) 0.407 OrNotHighLow 1134.11 (3.2%) 1146.96 (3.8%) 1.1% ( -5% - 8%) 0.310 HighSpanNear 3.88 (7.1%) 3.95 (4.9%) 1.7% ( -9% - 14%) 0.376 range 5910.33 (9.7%) 6049.55 (8.0%) 2.4% ( -14% - 22%) 0.403 BrowseDateSSDVFacets 1.19 (14.3%) 1.24 (19.0%) 4.1% ( -25% - 43%) 0.446 BrowseDateTaxoFacets 6.67 (4.6%) 7.08 (24.2%) 6.1% ( -21% - 36%) 0.264 BrowseDayOfYearTaxoFacets 6.74 (4.9%) 7.17 (23.8%) 6.4% ( -21% - 36%) 0.237 BrowseRandomLabelTaxoFacets 5.39 (3.7%) 6.02 (52.8%) 11.7% ( -43% - 70%) 0.322 BrowseMonthTaxoFacets 8.20 (35.8%) 9.48 (37.2%) 15.6% ( -42% - 138%) 0.177 On Thu, Dec 5, 2024 at 2:07 PM Michael Sokolov <msoko...@gmail.com> wrote: > > That makes sense to me too in the abstract. At Amazon we also have > interesting BDV fields we have to decode on the fly, so this looks > attractive for that reason (not just faceting). > > I would say though that it would be easier to evaluate the fitness for > purpose (faceting) if we had some examples of BinaryDocValues used for > faceting (or otherwise being decoded on the fly) in the Lucene code > base -- do we have that? I'd be concerned if we're not able to fully > test the new functionality to see what the impact of any changes might > be. > > On Thu, Dec 5, 2024 at 6:45 AM Chris Hegarty > <christopher.hega...@elastic.co.invalid> wrote: > > > > Hi Ignacio, > > > > I completely agree with the idea of having a BytesRef-like thing that can > > be off-heap. For a while now I’ve been thinking about how we could evolve > > BytesRef so as to not expose its on-heap representation. Having a separate > > primitive is probably a better way to go. > > > > -Chris. > > > > > On 5 Dec 2024, at 10:42, Ignacio Vera <iver...@gmail.com> wrote: > > > > > > Hello, > > > > > > I have been working with the idea of reading binary doc values > > > off-heap for a while. The idea behind it is that binary doc values are > > > often used for faceting where structure data is encoded at write time > > > and decoded at read time. It feels wasteful to have to read the data > > > on-heap before decoding it when we can read the data directly from the > > > off-heap buffer. > > > > > > The current proposal is to evolve the current API from an on-heap data > > > structure (BytesRef) to an off-heap data structure (currently named > > > RandomAccessInputRef). Because we are currently reading the data into > > > the buffer using a RandomAccessInput with an offset and a length, it > > > feels very natural to create an off-heap equivalent to BytesRef that > > > is backed by a RandomAccessInput. > > > > > > I am hoping to move this idea forward so I am asking for feedback as > > > this is a change on a public API so I would love to hear other > > > opinions. > > > > > > Thank you! > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org