Re: Off-heap binary doc values

Ignacio Vera Mon, 09 Dec 2024 01:11:42 -0800

I have run the luceneutil benchmark with higher iterations and repeat
count but they are still very noisy, which I blame for running those
benchmarks on a laptop.


The results always show some of the facets tasks having speed ups
while others having small slowdowns. One run that clearly shows a
slowdown is HighTermTitleBDVSort which I expect as we are reading
those bytes on heap using a BytesRefBuilder now. The only way to
prevent this slow down would be to make the off-heap bytesref thingy
to be able to implement the Comparable interface efficiently or
alternatively expose the doc values max length so implementations can
do the same as we are doing today.

                    TaskQPS baseline      StdDevQPS
my_modified_version      StdDev                Pct diff p-value
            HighTermTitleBDVSort        4.30      (2.8%)        4.19
   (2.0%)   -2.5% (  -7% -    2%) 0.000
           BrowseMonthTaxoFacets       10.25     (26.1%)       10.06
  (26.4%)   -1.8% ( -43% -   68%) 0.623
                      TermDTSort      216.72      (7.3%)      213.19
   (6.7%)   -1.6% ( -14% -   13%) 0.100
                    OrNotHighMed      464.01      (3.5%)      459.77
   (3.5%)   -0.9% (  -7% -    6%) 0.066
                        Wildcard      305.50      (4.1%)      303.08
   (3.9%)   -0.8% (  -8% -    7%) 0.164
                         Respell       60.92      (2.5%)       60.51
   (2.4%)   -0.7% (  -5% -    4%) 0.055
                      AndHighLow     1406.33      (3.1%)     1397.09
   (3.3%)   -0.7% (  -6% -    5%) 0.151
                   OrHighNotHigh      435.53      (4.8%)      432.71
   (5.6%)   -0.6% ( -10% -   10%) 0.383
                         LowTerm      931.83      (3.5%)      925.97
   (3.2%)   -0.6% (  -7% -    6%) 0.185
           HighTermDayOfYearSort      410.08      (4.9%)      407.77
   (4.9%)   -0.6% (  -9% -    9%) 0.416
        AndHighHighDayTaxoFacets        6.45      (2.1%)        6.42
   (2.4%)   -0.6% (  -4% -    3%) 0.081
                       MedPhrase      117.99      (2.7%)      117.34
   (2.7%)   -0.6% (  -5% -    4%) 0.142
                       LowPhrase       24.97      (2.9%)       24.84
   (3.1%)   -0.5% (  -6% -    5%) 0.210
         AndHighMedDayTaxoFacets       18.40      (2.0%)       18.31
   (2.3%)   -0.5% (  -4% -    3%) 0.095
                        HighTerm      793.15      (4.4%)      789.25
   (5.2%)   -0.5% (  -9% -    9%) 0.469
          OrHighMedDayTaxoFacets        4.64      (3.3%)        4.62
   (3.7%)   -0.5% (  -7% -    6%) 0.334
                       OrHighMed      188.68      (2.6%)      187.79
   (3.1%)   -0.5% (  -6% -    5%) 0.244
                    OrNotHighLow     1277.67      (2.5%)     1271.99
   (2.9%)   -0.4% (  -5% -    5%) 0.245
                      HighPhrase       59.28      (3.6%)       59.03
   (3.4%)   -0.4% (  -7% -    6%) 0.402
                          Fuzzy2       82.34      (2.2%)       82.02
   (2.2%)   -0.4% (  -4% -    4%) 0.205
                   OrNotHighHigh      688.74      (3.8%)      686.08
   (4.4%)   -0.4% (  -8% -    8%) 0.507
            MedTermDayTaxoFacets       10.69      (3.2%)       10.65
   (2.9%)   -0.4% (  -6% -    5%) 0.382
                    OrHighNotMed      722.62      (4.8%)      720.03
   (5.7%)   -0.4% ( -10% -   10%) 0.630
                         MedTerm      979.84      (3.5%)      976.55
   (4.0%)   -0.3% (  -7% -    7%) 0.528
                       OrHighLow      780.09      (2.7%)      777.60
   (2.8%)   -0.3% (  -5% -    5%) 0.413
                    OrHighNotLow      639.07      (5.1%)      637.34
   (6.0%)   -0.3% ( -10% -   11%) 0.728
                         Prefix3      371.35      (2.3%)      370.43
   (1.9%)   -0.2% (  -4% -    4%) 0.409
                      AndHighMed      127.87      (2.4%)      127.63
   (2.8%)   -0.2% (  -5% -    5%) 0.613
            HighIntervalsOrdered        4.88      (5.7%)        4.88
   (5.8%)   -0.1% ( -11% -   12%) 0.882
               HighTermMonthSort     2844.77      (2.7%)     2841.57
   (2.8%)   -0.1% (  -5% -    5%) 0.770
                      OrHighHigh       56.75      (1.8%)       56.70
   (2.1%)   -0.1% (  -3% -    3%) 0.793
             MedIntervalsOrdered        7.42      (3.7%)        7.41
   (3.8%)   -0.1% (  -7% -    7%) 0.894
             LowIntervalsOrdered       22.00      (3.6%)       21.99
   (3.6%)   -0.0% (  -7% -    7%) 0.930
                 MedSloppyPhrase       34.10      (2.3%)       34.11
   (2.3%)    0.0% (  -4% -    4%) 0.931
                           range     7621.67      (4.4%)     7625.28
   (3.8%)    0.0% (  -7% -    8%) 0.935
               HighTermTitleSort      208.38      (2.7%)      208.52
   (2.6%)    0.1% (  -5% -    5%) 0.858
                     LowSpanNear        9.95      (1.4%)        9.96
   (1.5%)    0.1% (  -2% -    3%) 0.668
                          Fuzzy1       70.10      (2.4%)       70.23
   (2.6%)    0.2% (  -4% -    5%) 0.616
                        PKLookup      262.66      (3.9%)      263.16
   (4.3%)    0.2% (  -7% -    8%) 0.743
                HighSloppyPhrase        7.09      (3.7%)        7.11
   (4.2%)    0.2% (  -7% -    8%) 0.732
                     AndHighHigh       49.37      (2.2%)       49.47
   (2.5%)    0.2% (  -4% -    4%) 0.538
                 LowSloppyPhrase      186.55      (6.7%)      186.96
   (6.7%)    0.2% ( -12% -   14%) 0.813
                     MedSpanNear       33.62      (2.6%)       33.71
   (2.8%)    0.2% (  -5% -    5%) 0.523
                          IntNRQ       25.87      (5.7%)       25.94
   (5.0%)    0.3% (  -9% -   11%) 0.713
                    HighSpanNear       11.58      (3.1%)       11.61
   (3.2%)    0.3% (  -5% -    6%) 0.528
       BrowseDayOfYearTaxoFacets        6.98      (8.0%)        7.08
  (12.6%)    1.4% ( -17% -   23%) 0.333
            BrowseDateTaxoFacets        6.90      (8.2%)        7.01
  (12.8%)    1.5% ( -17% -   24%) 0.321
     BrowseRandomLabelSSDVFacets        5.04      (8.6%)        5.15
  (11.5%)    2.1% ( -16% -   24%) 0.148
            BrowseDateSSDVFacets        1.37     (15.9%)        1.40
  (16.0%)    2.1% ( -25% -   40%) 0.341
       BrowseDayOfYearSSDVFacets        6.24     (10.4%)        6.39
  (12.9%)    2.4% ( -18% -   28%) 0.148
     BrowseRandomLabelTaxoFacets        5.65      (8.3%)        5.80
  (22.2%)    2.5% ( -25% -   35%) 0.286
           BrowseMonthSSDVFacets        6.24     (10.7%)        6.40
  (13.4%)    2.6% ( -19% -   29%) 0.136

On Sat, Dec 7, 2024 at 9:08 PM Adrien Grand <jpou...@gmail.com> wrote:
>
> FWIW I have also seen some users store sparse vectors or bloom filters in 
> binary doc values. In both cases, the serialized size may be non negligible 
> while not all bytes are needed. This change would likely help.
>
> Having the binary sort and faceting tasks not show a big slowdown would be 
> good as these should be the worst case scenario for this change, as all bytes 
> need to be read?
>
> @Ignacio Your luceneutil results show a couple significant speedups and small 
> slowdowns but the p-values are high, which suggests that results are very 
> noisy. I wonder if the benchmark had enough iterations or taskRepeatCount.
>
> Le jeu. 5 déc. 2024, 17:38, Ignacio Vera <iver...@gmail.com> a écrit :
>>
>> @Cris: Agreeing on an off-heap BytesRef thingy would be a great step forward.
>>
>> @Mike: Yes, there are other use cases. One that is close to my heart
>> is the geo use case where in many cases you don't need to read all the
>> bytes, and geometries can be big. In lucene there are some interesting
>> usages in the facets module which I already implemented in the PR.
>> Running the wikimedium benchmark on it (I think ) it shows an
>> improvement on the facets runs as well as some regressions:
>>
>>      BrowseRandomLabelSSDVFacets        5.21     (12.5%)        5.01
>>    (8.8%)   -3.8% ( -22% -   20%) 0.264
>>           OrHighMedDayTaxoFacets        4.78      (5.5%)        4.68
>>    (5.0%)   -2.1% ( -11% -    8%) 0.211
>>             HighTermTitleBDVSort        9.11      (1.5%)        8.96
>>    (1.4%)   -1.7% (  -4% -    1%) 0.000
>>        BrowseDayOfYearSSDVFacets        6.64     (15.7%)        6.54
>>   (15.1%)   -1.5% ( -27% -   34%) 0.765
>>            BrowseMonthSSDVFacets        6.50     (11.8%)        6.42
>>   (11.6%)   -1.3% ( -22% -   25%) 0.729
>>                         HighTerm      565.97      (6.5%)      562.79
>>    (7.0%)   -0.6% ( -13% -   13%) 0.793
>>                HighTermMonthSort     2350.90      (4.5%)     2338.02
>>    (4.0%)   -0.5% (  -8% -    8%) 0.684
>>                      AndHighHigh       55.14      (3.3%)       54.97
>>    (4.1%)   -0.3% (  -7% -    7%) 0.803
>>                     OrHighNotMed      406.21      (7.2%)      405.14
>>    (6.6%)   -0.3% ( -13% -   14%) 0.904
>>                    OrNotHighHigh      425.43      (3.5%)      424.78
>>    (3.3%)   -0.2% (  -6% -    6%) 0.886
>>             MedTermDayTaxoFacets       30.90      (2.0%)       30.86
>>    (1.6%)   -0.1% (  -3% -    3%) 0.834
>>                  MedSloppyPhrase       32.77      (3.7%)       32.73
>>    (3.7%)   -0.1% (  -7% -    7%) 0.921
>>                       AndHighMed      184.44      (3.4%)      184.37
>>    (3.6%)   -0.0% (  -6% -    7%) 0.969
>>                  LowSloppyPhrase       52.47      (1.7%)       52.47
>>    (1.6%)    0.0% (  -3% -    3%) 0.996
>>                    OrHighNotHigh      619.12      (5.0%)      619.46
>>    (4.6%)    0.1% (  -9% -   10%) 0.971
>>                     OrHighNotLow      567.45      (6.5%)      568.21
>>    (6.2%)    0.1% ( -11% -   13%) 0.947
>>                         PKLookup      275.72      (2.3%)      276.13
>>    (3.3%)    0.1% (  -5% -    5%) 0.872
>>              LowIntervalsOrdered        6.15      (2.1%)        6.16
>>    (2.5%)    0.1% (  -4% -    4%) 0.836
>>                           IntNRQ       76.94      (5.8%)       77.11
>>    (4.5%)    0.2% (  -9% -   11%) 0.895
>>                 HighSloppyPhrase        2.32      (2.7%)        2.33
>>    (2.3%)    0.3% (  -4% -    5%) 0.685
>>                          LowTerm      629.84      (3.5%)      632.74
>>    (3.3%)    0.5% (  -6% -    7%) 0.670
>>                      LowSpanNear       99.79      (2.7%)      100.30
>>    (3.2%)    0.5% (  -5% -    6%) 0.589
>>                          MedTerm      889.05      (4.2%)      893.75
>>    (4.6%)    0.5% (  -7% -    9%) 0.703
>>                     OrNotHighMed      361.55      (3.2%)      363.50
>>    (3.1%)    0.5% (  -5% -    7%) 0.591
>>                          Prefix3      134.42      (4.3%)      135.18
>>    (3.8%)    0.6% (  -7% -    8%) 0.656
>>                HighTermTitleSort      188.27      (2.1%)      189.35
>>    (2.5%)    0.6% (  -3% -    5%) 0.423
>>             HighIntervalsOrdered        7.97      (4.8%)        8.02
>>    (6.0%)    0.6% (  -9% -   11%) 0.736
>>                         Wildcard       57.19      (2.9%)       57.53
>>    (3.1%)    0.6% (  -5% -    6%) 0.525
>>                        OrHighLow      551.58      (2.9%)      555.28
>>    (2.5%)    0.7% (  -4% -    6%) 0.436
>>              MedIntervalsOrdered       29.22      (4.9%)       29.41
>>    (6.1%)    0.7% (  -9% -   12%) 0.697
>>                        MedPhrase       30.11      (2.1%)       30.32
>>    (1.5%)    0.7% (  -2% -    4%) 0.241
>>                       OrHighHigh       54.77      (6.7%)       55.15
>>    (5.2%)    0.7% ( -10% -   13%) 0.714
>>                           Fuzzy1      108.14      (2.8%)      108.90
>>    (2.4%)    0.7% (  -4% -    6%) 0.403
>>                        OrHighMed      182.51      (5.4%)      183.80
>>    (3.4%)    0.7% (  -7% -   10%) 0.622
>>          AndHighMedDayTaxoFacets       30.18      (3.1%)       30.40
>>    (2.3%)    0.7% (  -4% -    6%) 0.403
>>            HighTermDayOfYearSort      462.68      (3.7%)      466.03
>>    (3.6%)    0.7% (  -6% -    8%) 0.532
>>                       AndHighLow     1225.05      (5.2%)     1233.95
>>    (4.5%)    0.7% (  -8% -   10%) 0.636
>>                      MedSpanNear       13.85      (2.2%)       13.95
>>    (2.0%)    0.7% (  -3% -    5%) 0.264
>>                        LowPhrase      204.19      (2.6%)      205.88
>>    (1.9%)    0.8% (  -3% -    5%) 0.247
>>                       HighPhrase      105.85      (3.1%)      106.80
>>    (2.6%)    0.9% (  -4% -    6%) 0.322
>>                           Fuzzy2       22.92      (2.6%)       23.13
>>    (2.1%)    0.9% (  -3% -    5%) 0.233
>>                       TermDTSort      295.84      (7.3%)      298.66
>>    (6.6%)    1.0% ( -12% -   16%) 0.665
>>                          Respell       78.37      (2.3%)       79.15
>>    (1.8%)    1.0% (  -2% -    5%) 0.125
>>         AndHighHighDayTaxoFacets        2.70      (4.8%)        2.72
>>    (2.5%)    1.0% (  -6% -    8%) 0.407
>>                     OrNotHighLow     1134.11      (3.2%)     1146.96
>>    (3.8%)    1.1% (  -5% -    8%) 0.310
>>                     HighSpanNear        3.88      (7.1%)        3.95
>>    (4.9%)    1.7% (  -9% -   14%) 0.376
>>                            range     5910.33      (9.7%)     6049.55
>>    (8.0%)    2.4% ( -14% -   22%) 0.403
>>             BrowseDateSSDVFacets        1.19     (14.3%)        1.24
>>   (19.0%)    4.1% ( -25% -   43%) 0.446
>>             BrowseDateTaxoFacets        6.67      (4.6%)        7.08
>>   (24.2%)    6.1% ( -21% -   36%) 0.264
>>        BrowseDayOfYearTaxoFacets        6.74      (4.9%)        7.17
>>   (23.8%)    6.4% ( -21% -   36%) 0.237
>>      BrowseRandomLabelTaxoFacets        5.39      (3.7%)        6.02
>>   (52.8%)   11.7% ( -43% -   70%) 0.322
>>            BrowseMonthTaxoFacets        8.20     (35.8%)        9.48
>>   (37.2%)   15.6% ( -42% -  138%) 0.177
>>
>>
>>
>> On Thu, Dec 5, 2024 at 2:07 PM Michael Sokolov <msoko...@gmail.com> wrote:
>> >
>> > That makes sense to me too in the abstract. At Amazon we also have
>> > interesting BDV fields we have to decode on the fly, so this looks
>> > attractive for that reason (not just faceting).
>> >
>> > I would say though that it would be easier to evaluate the fitness for
>> > purpose (faceting) if we had some examples of BinaryDocValues used for
>> > faceting (or otherwise being decoded on the fly) in the Lucene code
>> > base -- do we have that?  I'd be concerned if we're not able to fully
>> > test the new functionality to see what the impact of any changes might
>> > be.
>> >
>> > On Thu, Dec 5, 2024 at 6:45 AM Chris Hegarty
>> > <christopher.hega...@elastic.co.invalid> wrote:
>> > >
>> > > Hi Ignacio,
>> > >
>> > > I completely agree with the idea of having a BytesRef-like thing that 
>> > > can be off-heap. For a while now I’ve been thinking about how we could 
>> > > evolve BytesRef so as to not expose its on-heap representation. Having a 
>> > > separate primitive is probably a better way to go.
>> > >
>> > > -Chris.
>> > >
>> > > > On 5 Dec 2024, at 10:42, Ignacio Vera <iver...@gmail.com> wrote:
>> > > >
>> > > > Hello,
>> > > >
>> > > > I have been working with the idea of reading binary doc values
>> > > > off-heap for a while. The idea behind it is that binary doc values are
>> > > > often used for faceting where structure data is encoded at write time
>> > > > and decoded at read time. It feels wasteful to have to read the data
>> > > > on-heap before decoding it when we can read the data directly from the
>> > > > off-heap buffer.
>> > > >
>> > > > The current proposal is to evolve the current API from an on-heap data
>> > > > structure (BytesRef) to an off-heap data structure (currently named
>> > > > RandomAccessInputRef). Because we are currently reading the data into
>> > > > the buffer using a RandomAccessInput with an offset and a length, it
>> > > > feels very natural to create an off-heap equivalent to BytesRef that
>> > > > is backed by a RandomAccessInput.
>> > > >
>> > > > I am hoping to move this idea forward so I am asking for feedback as
>> > > > this is a change on a public API so I would love to hear other
>> > > > opinions.
>> > > >
>> > > > Thank you!
>> > > >
>> > > > ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > > > For additional commands, e-mail: dev-h...@lucene.apache.org
>> > > >
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > > For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Off-heap binary doc values

Reply via email to