FWIW I have also seen some users store sparse vectors or bloom filters in
binary doc values. In both cases, the serialized size may be non negligible
while not all bytes are needed. This change would likely help.

Having the binary sort and faceting tasks not show a big slowdown would be
good as these should be the worst case scenario for this change, as all
bytes need to be read?

@Ignacio Your luceneutil results show a couple significant speedups and
small slowdowns but the p-values are high, which suggests that results are
very noisy. I wonder if the benchmark had enough iterations or
taskRepeatCount.

Le jeu. 5 déc. 2024, 17:38, Ignacio Vera <iver...@gmail.com> a écrit :

> @Cris: Agreeing on an off-heap BytesRef thingy would be a great step
> forward.
>
> @Mike: Yes, there are other use cases. One that is close to my heart
> is the geo use case where in many cases you don't need to read all the
> bytes, and geometries can be big. In lucene there are some interesting
> usages in the facets module which I already implemented in the PR.
> Running the wikimedium benchmark on it (I think ) it shows an
> improvement on the facets runs as well as some regressions:
>
>      BrowseRandomLabelSSDVFacets        5.21     (12.5%)        5.01
>    (8.8%)   -3.8% ( -22% -   20%) 0.264
>           OrHighMedDayTaxoFacets        4.78      (5.5%)        4.68
>    (5.0%)   -2.1% ( -11% -    8%) 0.211
>             HighTermTitleBDVSort        9.11      (1.5%)        8.96
>    (1.4%)   -1.7% (  -4% -    1%) 0.000
>        BrowseDayOfYearSSDVFacets        6.64     (15.7%)        6.54
>   (15.1%)   -1.5% ( -27% -   34%) 0.765
>            BrowseMonthSSDVFacets        6.50     (11.8%)        6.42
>   (11.6%)   -1.3% ( -22% -   25%) 0.729
>                         HighTerm      565.97      (6.5%)      562.79
>    (7.0%)   -0.6% ( -13% -   13%) 0.793
>                HighTermMonthSort     2350.90      (4.5%)     2338.02
>    (4.0%)   -0.5% (  -8% -    8%) 0.684
>                      AndHighHigh       55.14      (3.3%)       54.97
>    (4.1%)   -0.3% (  -7% -    7%) 0.803
>                     OrHighNotMed      406.21      (7.2%)      405.14
>    (6.6%)   -0.3% ( -13% -   14%) 0.904
>                    OrNotHighHigh      425.43      (3.5%)      424.78
>    (3.3%)   -0.2% (  -6% -    6%) 0.886
>             MedTermDayTaxoFacets       30.90      (2.0%)       30.86
>    (1.6%)   -0.1% (  -3% -    3%) 0.834
>                  MedSloppyPhrase       32.77      (3.7%)       32.73
>    (3.7%)   -0.1% (  -7% -    7%) 0.921
>                       AndHighMed      184.44      (3.4%)      184.37
>    (3.6%)   -0.0% (  -6% -    7%) 0.969
>                  LowSloppyPhrase       52.47      (1.7%)       52.47
>    (1.6%)    0.0% (  -3% -    3%) 0.996
>                    OrHighNotHigh      619.12      (5.0%)      619.46
>    (4.6%)    0.1% (  -9% -   10%) 0.971
>                     OrHighNotLow      567.45      (6.5%)      568.21
>    (6.2%)    0.1% ( -11% -   13%) 0.947
>                         PKLookup      275.72      (2.3%)      276.13
>    (3.3%)    0.1% (  -5% -    5%) 0.872
>              LowIntervalsOrdered        6.15      (2.1%)        6.16
>    (2.5%)    0.1% (  -4% -    4%) 0.836
>                           IntNRQ       76.94      (5.8%)       77.11
>    (4.5%)    0.2% (  -9% -   11%) 0.895
>                 HighSloppyPhrase        2.32      (2.7%)        2.33
>    (2.3%)    0.3% (  -4% -    5%) 0.685
>                          LowTerm      629.84      (3.5%)      632.74
>    (3.3%)    0.5% (  -6% -    7%) 0.670
>                      LowSpanNear       99.79      (2.7%)      100.30
>    (3.2%)    0.5% (  -5% -    6%) 0.589
>                          MedTerm      889.05      (4.2%)      893.75
>    (4.6%)    0.5% (  -7% -    9%) 0.703
>                     OrNotHighMed      361.55      (3.2%)      363.50
>    (3.1%)    0.5% (  -5% -    7%) 0.591
>                          Prefix3      134.42      (4.3%)      135.18
>    (3.8%)    0.6% (  -7% -    8%) 0.656
>                HighTermTitleSort      188.27      (2.1%)      189.35
>    (2.5%)    0.6% (  -3% -    5%) 0.423
>             HighIntervalsOrdered        7.97      (4.8%)        8.02
>    (6.0%)    0.6% (  -9% -   11%) 0.736
>                         Wildcard       57.19      (2.9%)       57.53
>    (3.1%)    0.6% (  -5% -    6%) 0.525
>                        OrHighLow      551.58      (2.9%)      555.28
>    (2.5%)    0.7% (  -4% -    6%) 0.436
>              MedIntervalsOrdered       29.22      (4.9%)       29.41
>    (6.1%)    0.7% (  -9% -   12%) 0.697
>                        MedPhrase       30.11      (2.1%)       30.32
>    (1.5%)    0.7% (  -2% -    4%) 0.241
>                       OrHighHigh       54.77      (6.7%)       55.15
>    (5.2%)    0.7% ( -10% -   13%) 0.714
>                           Fuzzy1      108.14      (2.8%)      108.90
>    (2.4%)    0.7% (  -4% -    6%) 0.403
>                        OrHighMed      182.51      (5.4%)      183.80
>    (3.4%)    0.7% (  -7% -   10%) 0.622
>          AndHighMedDayTaxoFacets       30.18      (3.1%)       30.40
>    (2.3%)    0.7% (  -4% -    6%) 0.403
>            HighTermDayOfYearSort      462.68      (3.7%)      466.03
>    (3.6%)    0.7% (  -6% -    8%) 0.532
>                       AndHighLow     1225.05      (5.2%)     1233.95
>    (4.5%)    0.7% (  -8% -   10%) 0.636
>                      MedSpanNear       13.85      (2.2%)       13.95
>    (2.0%)    0.7% (  -3% -    5%) 0.264
>                        LowPhrase      204.19      (2.6%)      205.88
>    (1.9%)    0.8% (  -3% -    5%) 0.247
>                       HighPhrase      105.85      (3.1%)      106.80
>    (2.6%)    0.9% (  -4% -    6%) 0.322
>                           Fuzzy2       22.92      (2.6%)       23.13
>    (2.1%)    0.9% (  -3% -    5%) 0.233
>                       TermDTSort      295.84      (7.3%)      298.66
>    (6.6%)    1.0% ( -12% -   16%) 0.665
>                          Respell       78.37      (2.3%)       79.15
>    (1.8%)    1.0% (  -2% -    5%) 0.125
>         AndHighHighDayTaxoFacets        2.70      (4.8%)        2.72
>    (2.5%)    1.0% (  -6% -    8%) 0.407
>                     OrNotHighLow     1134.11      (3.2%)     1146.96
>    (3.8%)    1.1% (  -5% -    8%) 0.310
>                     HighSpanNear        3.88      (7.1%)        3.95
>    (4.9%)    1.7% (  -9% -   14%) 0.376
>                            range     5910.33      (9.7%)     6049.55
>    (8.0%)    2.4% ( -14% -   22%) 0.403
>             BrowseDateSSDVFacets        1.19     (14.3%)        1.24
>   (19.0%)    4.1% ( -25% -   43%) 0.446
>             BrowseDateTaxoFacets        6.67      (4.6%)        7.08
>   (24.2%)    6.1% ( -21% -   36%) 0.264
>        BrowseDayOfYearTaxoFacets        6.74      (4.9%)        7.17
>   (23.8%)    6.4% ( -21% -   36%) 0.237
>      BrowseRandomLabelTaxoFacets        5.39      (3.7%)        6.02
>   (52.8%)   11.7% ( -43% -   70%) 0.322
>            BrowseMonthTaxoFacets        8.20     (35.8%)        9.48
>   (37.2%)   15.6% ( -42% -  138%) 0.177
>
>
>
> On Thu, Dec 5, 2024 at 2:07 PM Michael Sokolov <msoko...@gmail.com> wrote:
> >
> > That makes sense to me too in the abstract. At Amazon we also have
> > interesting BDV fields we have to decode on the fly, so this looks
> > attractive for that reason (not just faceting).
> >
> > I would say though that it would be easier to evaluate the fitness for
> > purpose (faceting) if we had some examples of BinaryDocValues used for
> > faceting (or otherwise being decoded on the fly) in the Lucene code
> > base -- do we have that?  I'd be concerned if we're not able to fully
> > test the new functionality to see what the impact of any changes might
> > be.
> >
> > On Thu, Dec 5, 2024 at 6:45 AM Chris Hegarty
> > <christopher.hega...@elastic.co.invalid> wrote:
> > >
> > > Hi Ignacio,
> > >
> > > I completely agree with the idea of having a BytesRef-like thing that
> can be off-heap. For a while now I’ve been thinking about how we could
> evolve BytesRef so as to not expose its on-heap representation. Having a
> separate primitive is probably a better way to go.
> > >
> > > -Chris.
> > >
> > > > On 5 Dec 2024, at 10:42, Ignacio Vera <iver...@gmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I have been working with the idea of reading binary doc values
> > > > off-heap for a while. The idea behind it is that binary doc values
> are
> > > > often used for faceting where structure data is encoded at write time
> > > > and decoded at read time. It feels wasteful to have to read the data
> > > > on-heap before decoding it when we can read the data directly from
> the
> > > > off-heap buffer.
> > > >
> > > > The current proposal is to evolve the current API from an on-heap
> data
> > > > structure (BytesRef) to an off-heap data structure (currently named
> > > > RandomAccessInputRef). Because we are currently reading the data into
> > > > the buffer using a RandomAccessInput with an offset and a length, it
> > > > feels very natural to create an off-heap equivalent to BytesRef that
> > > > is backed by a RandomAccessInput.
> > > >
> > > > I am hoping to move this idea forward so I am asking for feedback as
> > > > this is a change on a public API so I would love to hear other
> > > > opinions.
> > > >
> > > > Thank you!
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to