Feng Guo created LUCENE-10333:
---------------------------------
Summary: Speed up BinaryDocValues with a batch reading on
LongValues
Key: LUCENE-10333
URL: https://issues.apache.org/jira/browse/LUCENE-10333
Project: Lucene - Core
Issue Type: Improvement
Components: core/codecs
Reporter: Feng Guo
*Description*
In {{{}Lucene90DocValuesProducer{}}}, {{BinaryDocValue}} (as well as
{{SortedNumericDocValues}} not in singleton case) has code patterns like this:
{code:java}
long startOffset = addresses.get(doc);
bytes.length = (int) (addresses.get(doc + 1L) - startOffset);
{code}
This means we need to read 2 longs stored together. We could probably push down
this info to {{LongValues}} and read 2 values together in one call. I think
this can make sense because these codes could be rather hot.
*Benchmark*
In today's LuceneUtil benchmark, all results looks even. I suspect this is
because we do not use {{BinaryDocValues}} any more in tasks. So i tried to roll
back the baseline and candidate to a stale code version (before
https://issues.apache.org/jira/browse/LUCENE-10062), we used
{{BinaryDocvalues}} to store taxonomy ordinals in that version, and it can been
seen a QPS increasing there. (This is tricky, i wonder if we can have a more
official way to benchmark BinaryDocValues by chaging some params or add some
tasks?) Anyway, I believe It is still worth optimizing {{BinarayDocValue}}
though facets do not use it any more :)
*Benchmark result on stale code version where taxonomy ordinals are stored in
BinaryDocvalues (to justivy a speed up in BinaryDocValues)*
{code:java}
TaskQPS baseline StdDevQPS my_modified_version
StdDev Pct diff p-value
BrowseMonthSSDVFacets 17.25 (8.6%) 16.78
(17.8%) -2.7% ( -26% - 25%) 0.536
LowTerm 1458.66 (3.6%) 1438.15
(4.4%) -1.4% ( -9% - 6%) 0.268
HighTermDayOfYearSort 108.55 (10.0%) 108.04
(9.1%) -0.5% ( -17% - 20%) 0.874
HighPhrase 168.65 (1.9%) 168.06
(2.3%) -0.3% ( -4% - 3%) 0.602
OrNotHighLow 1201.79 (3.4%) 1197.93
(4.6%) -0.3% ( -8% - 7%) 0.801
HighSpanNear 15.26 (1.6%) 15.21
(1.4%) -0.3% ( -3% - 2%) 0.499
Respell 62.61 (1.8%) 62.45
(1.9%) -0.3% ( -3% - 3%) 0.649
MedPhrase 57.57 (1.4%) 57.44
(1.8%) -0.2% ( -3% - 2%) 0.648
OrHighMed 129.10 (3.0%) 128.83
(3.1%) -0.2% ( -6% - 6%) 0.830
MedSpanNear 19.45 (2.3%) 19.41
(2.2%) -0.2% ( -4% - 4%) 0.784
OrHighHigh 34.85 (1.5%) 34.79
(1.4%) -0.2% ( -3% - 2%) 0.722
HighIntervalsOrdered 26.92 (4.7%) 26.89
(4.9%) -0.1% ( -9% - 9%) 0.929
IntNRQ 343.52 (1.6%) 343.16
(2.0%) -0.1% ( -3% - 3%) 0.855
OrHighNotHigh 595.61 (3.2%) 595.10
(4.3%) -0.1% ( -7% - 7%) 0.944
MedIntervalsOrdered 17.66 (3.6%) 17.65
(3.8%) -0.1% ( -7% - 7%) 0.961
LowIntervalsOrdered 109.23 (3.3%) 109.18
(3.5%) -0.0% ( -6% - 7%) 0.969
AndHighHigh 81.09 (1.5%) 81.10
(2.0%) 0.0% ( -3% - 3%) 0.967
LowSpanNear 203.33 (2.1%) 203.41
(1.8%) 0.0% ( -3% - 3%) 0.948
MedSloppyPhrase 27.15 (1.5%) 27.17
(1.2%) 0.1% ( -2% - 2%) 0.907
LowPhrase 75.76 (1.8%) 75.81
(2.0%) 0.1% ( -3% - 3%) 0.904
AndHighMedDayTaxoFacets 97.27 (1.9%) 97.35
(1.9%) 0.1% ( -3% - 4%) 0.888
HighSloppyPhrase 14.32 (2.7%) 14.34
(1.8%) 0.1% ( -4% - 4%) 0.870
Fuzzy2 76.00 (3.9%) 76.12
(3.4%) 0.2% ( -6% - 7%) 0.894
Wildcard 123.51 (1.8%) 123.71
(2.1%) 0.2% ( -3% - 4%) 0.796
OrHighNotLow 722.64 (4.4%) 724.15
(5.4%) 0.2% ( -9% - 10%) 0.894
AndHighLow 929.73 (4.0%) 931.75
(3.8%) 0.2% ( -7% - 8%) 0.859
Prefix3 240.13 (1.5%) 240.69
(1.9%) 0.2% ( -3% - 3%) 0.675
AndHighMed 210.17 (1.7%) 210.84
(1.6%) 0.3% ( -2% - 3%) 0.532
LowSloppyPhrase 142.83 (1.8%) 143.54
(2.0%) 0.5% ( -3% - 4%) 0.410
OrNotHighMed 709.24 (4.4%) 712.78
(4.3%) 0.5% ( -7% - 9%) 0.715
Fuzzy1 85.33 (5.7%) 85.77
(6.3%) 0.5% ( -10% - 13%) 0.786
MedTerm 1466.50 (3.5%) 1474.85
(3.9%) 0.6% ( -6% - 8%) 0.629
TermDTSort 105.51 (7.7%) 106.33
(7.3%) 0.8% ( -13% - 17%) 0.746
PKLookup 206.18 (2.9%) 208.68
(2.9%) 1.2% ( -4% - 7%) 0.179
OrHighNotMed 876.71 (3.0%) 887.84
(3.9%) 1.3% ( -5% - 8%) 0.251
OrNotHighHigh 774.25 (4.7%) 785.03
(6.0%) 1.4% ( -8% - 12%) 0.411
HighTermMonthSort 74.33 (9.4%) 75.47
(16.3%) 1.5% ( -22% - 30%) 0.716
OrHighLow 518.73 (5.2%) 528.27
(5.4%) 1.8% ( -8% - 13%) 0.272
HighTerm 1892.16 (3.4%) 1934.63
(5.5%) 2.2% ( -6% - 11%) 0.120
AndHighHighDayTaxoFacets 16.46 (2.7%) 16.84
(2.3%) 2.3% ( -2% - 7%) 0.004
HighTermTitleBDVSort 141.39 (14.6%) 145.33
(15.1%) 2.8% ( -23% - 38%) 0.554
MedTermDayTaxoFacets 27.81 (2.1%) 29.54
(2.3%) 6.2% ( 1% - 10%) 0.000
OrHighMedDayTaxoFacets 3.05 (1.9%) 3.30
(2.2%) 8.3% ( 4% - 12%) 0.000
BrowseDayOfYearSSDVFacets 17.36 (13.0%) 18.97
(15.8%) 9.3% ( -17% - 43%) 0.042
BrowseDayOfYearTaxoFacets 3.02 (3.6%) 3.79
(2.5%) 25.4% ( 18% - 32%) 0.000
BrowseDateTaxoFacets 3.01 (3.6%) 3.79
(2.5%) 25.6% ( 18% - 32%) 0.000
BrowseMonthTaxoFacets 3.14 (2.1%) 3.99
(2.5%) 27.0% ( 21% - 32%) 0.000
{code}
*newest code version*
{code:java}
TaskQPS baseline StdDevQPS my_modified_version
StdDev Pct diff p-value
TermDTSort 129.74 (10.9%) 127.83
(11.3%) -1.5% ( -21% - 23%) 0.675
HighTerm 1182.13 (5.1%) 1172.76
(6.5%) -0.8% ( -11% - 11%) 0.668
HighSpanNear 7.99 (4.2%) 7.96
(4.2%) -0.3% ( -8% - 8%) 0.816
HighIntervalsOrdered 17.86 (2.1%) 17.85
(2.3%) -0.1% ( -4% - 4%) 0.927
BrowseDateTaxoFacets 19.61 (17.2%) 19.61
(17.4%) -0.0% ( -29% - 41%) 0.995
OrNotHighHigh 619.85 (4.3%) 619.72
(8.6%) -0.0% ( -12% - 13%) 0.992
PKLookup 202.14 (5.6%) 202.11
(4.4%) -0.0% ( -9% - 10%) 0.994
LowIntervalsOrdered 25.53 (1.5%) 25.53
(1.6%) 0.0% ( -3% - 3%) 1.000
BrowseDayOfYearSSDVFacets 14.27 (2.7%) 14.28
(2.7%) 0.0% ( -5% - 5%) 0.965
MedIntervalsOrdered 47.33 (1.9%) 47.34
(2.0%) 0.0% ( -3% - 3%) 0.947
BrowseRandomLabelSSDVFacets 10.25 (2.4%) 10.26
(2.4%) 0.1% ( -4% - 4%) 0.935
BrowseMonthSSDVFacets 15.66 (3.0%) 15.67
(3.0%) 0.1% ( -5% - 6%) 0.945
MedSloppyPhrase 11.97 (1.7%) 11.98
(1.9%) 0.1% ( -3% - 3%) 0.840
Wildcard 25.71 (2.6%) 25.75
(2.4%) 0.1% ( -4% - 5%) 0.875
MedPhrase 33.62 (2.5%) 33.68
(2.6%) 0.2% ( -4% - 5%) 0.802
HighTermDayOfYearSort 80.58 (11.0%) 80.76
(10.6%) 0.2% ( -19% - 24%) 0.949
HighTermTitleBDVSort 130.43 (11.7%) 130.73
(10.7%) 0.2% ( -19% - 25%) 0.947
AndHighHighDayTaxoFacets 32.25 (3.0%) 32.33
(2.9%) 0.2% ( -5% - 6%) 0.796
LowSloppyPhrase 39.50 (1.7%) 39.61
(1.4%) 0.3% ( -2% - 3%) 0.586
Prefix3 127.42 (3.8%) 127.77
(3.4%) 0.3% ( -6% - 7%) 0.812
HighTermMonthSort 117.65 (8.4%) 117.98
(8.1%) 0.3% ( -14% - 18%) 0.915
HighSloppyPhrase 14.47 (1.8%) 14.51
(2.2%) 0.3% ( -3% - 4%) 0.647
MedSpanNear 48.78 (2.2%) 48.93
(2.0%) 0.3% ( -3% - 4%) 0.640
OrHighMedDayTaxoFacets 13.42 (3.7%) 13.48
(3.6%) 0.4% ( -6% - 7%) 0.730
AndHighMedDayTaxoFacets 37.90 (3.0%) 38.05
(3.4%) 0.4% ( -5% - 7%) 0.694
Fuzzy1 83.31 (3.9%) 83.70
(4.9%) 0.5% ( -7% - 9%) 0.738
Respell 49.74 (1.3%) 50.00
(1.5%) 0.5% ( -2% - 3%) 0.254
OrHighLow 531.57 (8.0%) 534.83
(6.7%) 0.6% ( -13% - 16%) 0.792
AndHighHigh 71.99 (2.6%) 72.44
(3.4%) 0.6% ( -5% - 6%) 0.520
LowSpanNear 191.64 (3.5%) 192.85
(3.7%) 0.6% ( -6% - 8%) 0.580
MedTermDayTaxoFacets 55.51 (3.1%) 55.86
(3.9%) 0.6% ( -6% - 7%) 0.567
BrowseRandomLabelTaxoFacets 11492.93 (5.0%) 11570.83
(4.8%) 0.7% ( -8% - 11%) 0.663
IntNRQ 93.40 (2.1%) 94.05
(2.4%) 0.7% ( -3% - 5%) 0.319
AndHighMed 175.02 (2.6%) 176.42
(3.9%) 0.8% ( -5% - 7%) 0.445
Fuzzy2 45.25 (7.2%) 45.64
(6.2%) 0.9% ( -11% - 15%) 0.682
AndHighLow 825.32 (6.8%) 833.43
(8.0%) 1.0% ( -12% - 16%) 0.677
MedTerm 1408.91 (6.2%) 1423.27
(10.2%) 1.0% ( -14% - 18%) 0.703
OrHighMed 136.68 (3.8%) 138.15
(3.6%) 1.1% ( -6% - 8%) 0.356
OrHighHigh 16.31 (3.4%) 16.49
(1.9%) 1.1% ( -4% - 6%) 0.205
BrowseDayOfYearTaxoFacets 11349.30 (4.4%) 11494.17
(4.6%) 1.3% ( -7% - 10%) 0.366
HighPhrase 83.13 (2.9%) 84.24
(3.4%) 1.3% ( -4% - 7%) 0.184
OrHighNotMed 630.30 (5.6%) 639.65
(6.4%) 1.5% ( -9% - 14%) 0.436
LowPhrase 310.17 (4.2%) 315.08
(5.4%) 1.6% ( -7% - 11%) 0.297
OrHighNotHigh 723.22 (5.0%) 734.71
(8.4%) 1.6% ( -11% - 15%) 0.468
BrowseMonthTaxoFacets 11665.05 (7.6%) 11892.66
(5.1%) 2.0% ( -9% - 15%) 0.339
OrHighNotLow 851.60 (6.5%) 869.16
(7.6%) 2.1% ( -11% - 17%) 0.355
OrNotHighMed 699.29 (5.2%) 717.74
(7.7%) 2.6% ( -9% - 16%) 0.205
OrNotHighLow 954.65 (6.4%) 982.93
(9.6%) 3.0% ( -12% - 20%) 0.252
LowTerm 2158.23 (9.1%) 2227.33
(13.4%) 3.2% ( -17% - 28%) 0.377
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]