[
https://issues.apache.org/jira/browse/LUCENE-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309021#comment-17309021
]
Greg Miller commented on LUCENE-9877:
-------------------------------------
Here are my results from comparing 7 allowable exceptions in PFOR to the
baseline 3:
Similar to my approach in
[LUCENE-9850|https://issues.apache.org/jira/browse/LUCENE-9850], I threw
together a tool that visualizes the bits-per-value used across all PFOR encoded
blocks. I also added in a visualization of the exception count used across all
PFOR blocks. Here's what I found by building the Wikipedia EN index with the
"wikimedium10m" source in luceneutil (all zero-valued histogram entries were
truncated for readability):
{code:java}
BASELINE (3 exceptions)
PFOR BPV Histogram
0 *** [4.02 %] (239453 of
5959788)
1 ** [2.78 %] (165409 of
5959788)
2 ************************** [51.98 %] (3097737 of
5959788)
3 ***************** [32.61 %] (1943685 of
5959788)
4 **** [7.06 %] (420552 of
5959788)
5 * [1.49 %] (88700 of
5959788)
6 * [0.07 %] (4097 of
5959788)
7 * [0.00 %] (153 of 5959788)
8 * [0.00 %] (2 of 5959788)
Total bytes used: 234757569
PFOR Exceptions Used Histogram
0 ************************** [51.83 %] (3088921 of
5959788)
1 ************ [22.78 %] (1357732 of
5959788)
2 ******** [14.63 %] (872061 of
5959788)
3 ****** [10.76 %] (641074 of
5959788)
CANDIDATE (7 exceptions)
PFOR BPV Histogram
0 *** [4.02 %] (239583 of
5959788)
1 ***** [8.75 %] (521487 of
5959788)
2 ******************************** [62.40 %] (3718819 of
5959788)
3 *********** [20.26 %] (1207623 of
5959788)
4 ** [3.71 %] (221391 of
5959788)
5 * [0.82 %] (48939 of
5959788)
6 * [0.03 %] (1868 of
5959788)
7 * [0.00 %] (76 of 5959788)
8 * [0.00 %] (2 of 5959788)
Total bytes used: 216891765
PFOR Exceptions Used Histogram
0 **************** [30.93 %] (1843145 of
5959788)
1 ********** [18.46 %] (1100315 of
5959788)
2 ******* [13.38 %] (797217 of
5959788)
3 ****** [10.34 %] (615991 of
5959788)
4 ***** [8.36 %] (498042 of
5959788)
5 **** [7.03 %] (419005 of
5959788)
6 **** [6.11 %] (364388 of
5959788)
7 *** [5.40 %] (321685 of
5959788)
{code}
So we can observe the additional exceptions getting utilized and the
bits-per-value generally shifting lower. This seems to amount to a ~7.6%
decrease in space used for all of the PFOR blocks in aggregate.
Looking at the overall index size though, I only see a ~0.5% decrease in the
index size. :
{code:java}
3816536
wikimedium10m.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
3795532
wikimedium10m.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
{code}
I suspect the PFOR data in this index (frequencies, positions, offsets and
payloads) just aren't a big enough portion of the overall index to move the
needle much in overall size. If my tool is accurate, it looks like only ~0.2 GB
of the ~3.8 GB index is PFOR data.
As for performance implications, running a luceneutil benchmark (using "-source
wikimedium10m") shows the following:
{code:java}
TaskQPS baseline StdDevQPS pfor 7 exceptions
StdDev Pct diff p-value
HighTermDayOfYearSort 139.37 (10.8%) 135.30 (10.2%)
-2.9% ( -21% - 20%) 0.379
HighTermMonthSort 134.56 (13.4%) 130.65 (11.9%)
-2.9% ( -24% - 25%) 0.470
TermDTSort 100.66 (12.9%) 97.96 (13.5%)
-2.7% ( -25% - 27%) 0.521
HighTermTitleBDVSort 130.20 (9.6%) 127.06 (8.4%)
-2.4% ( -18% - 17%) 0.398
MedSloppyPhrase 45.83 (3.7%) 44.82 (3.8%)
-2.2% ( -9% - 5%) 0.063
OrHighNotHigh 505.30 (3.7%) 498.15 (4.6%)
-1.4% ( -9% - 7%) 0.288
OrHighLow 413.80 (5.0%) 408.04 (4.9%)
-1.4% ( -10% - 8%) 0.371
AndHighLow 575.78 (3.5%) 568.21 (3.6%)
-1.3% ( -8% - 5%) 0.241
MedPhrase 19.20 (2.5%) 18.98 (2.4%)
-1.1% ( -5% - 3%) 0.138
OrHighNotLow 504.91 (4.7%) 499.19 (4.2%)
-1.1% ( -9% - 8%) 0.423
LowSloppyPhrase 44.42 (4.0%) 44.01 (3.1%)
-0.9% ( -7% - 6%) 0.413
OrNotHighLow 789.01 (3.7%) 781.98 (4.4%)
-0.9% ( -8% - 7%) 0.488
LowSpanNear 100.30 (2.1%) 99.42 (2.0%)
-0.9% ( -4% - 3%) 0.178
HighIntervalsOrdered 13.86 (2.0%) 13.75 (1.9%)
-0.8% ( -4% - 3%) 0.174
Prefix3 230.15 (2.8%) 228.24 (3.3%)
-0.8% ( -6% - 5%) 0.394
HighSloppyPhrase 1.77 (3.8%) 1.76 (3.4%)
-0.8% ( -7% - 6%) 0.477
BrowseMonthTaxoFacets 5.01 (2.2%) 4.97 (3.5%)
-0.8% ( -6% - 4%) 0.396
AndHighHigh 35.38 (2.8%) 35.12 (3.1%)
-0.7% ( -6% - 5%) 0.428
Wildcard 63.19 (1.9%) 62.83 (1.9%)
-0.6% ( -4% - 3%) 0.337
HighSpanNear 11.63 (2.8%) 11.57 (2.2%)
-0.6% ( -5% - 4%) 0.492
OrHighNotMed 516.53 (4.0%) 513.71 (4.1%)
-0.5% ( -8% - 7%) 0.670
BrowseDayOfYearTaxoFacets 4.20 (2.5%) 4.18 (2.9%)
-0.5% ( -5% - 5%) 0.541
LowPhrase 51.17 (3.0%) 50.97 (3.0%)
-0.4% ( -6% - 5%) 0.674
MedTerm 1080.72 (5.4%) 1076.53 (4.9%)
-0.4% ( -10% - 10%) 0.812
BrowseDateTaxoFacets 4.19 (2.5%) 4.18 (2.8%)
-0.4% ( -5% - 5%) 0.657
HighTerm 923.10 (3.8%) 920.08 (4.1%)
-0.3% ( -7% - 7%) 0.795
Respell 40.97 (1.8%) 40.84 (2.2%)
-0.3% ( -4% - 3%) 0.620
OrHighMed 88.93 (2.7%) 88.72 (1.7%)
-0.2% ( -4% - 4%) 0.739
HighPhrase 314.90 (2.9%) 314.29 (4.1%)
-0.2% ( -6% - 6%) 0.863
BrowseDayOfYearSSDVFacets 11.39 (5.8%) 11.38 (5.8%)
-0.1% ( -11% - 12%) 0.957
BrowseMonthSSDVFacets 12.55 (3.3%) 12.54 (3.3%)
-0.1% ( -6% - 6%) 0.924
PKLookup 146.67 (4.3%) 146.58 (3.0%)
-0.1% ( -7% - 7%) 0.960
IntNRQ 53.49 (0.7%) 53.50 (0.7%)
0.0% ( -1% - 1%) 0.935
OrNotHighHigh 480.86 (3.8%) 481.91 (3.9%)
0.2% ( -7% - 8%) 0.857
MedSpanNear 130.25 (2.5%) 130.55 (2.5%)
0.2% ( -4% - 5%) 0.775
AndHighMed 159.94 (2.4%) 160.41 (2.6%)
0.3% ( -4% - 5%) 0.704
LowTerm 1030.42 (3.7%) 1034.09 (5.3%)
0.4% ( -8% - 9%) 0.804
OrHighHigh 14.59 (2.2%) 14.65 (1.3%)
0.4% ( -3% - 4%) 0.514
Fuzzy2 51.29 (7.7%) 51.74 (9.7%)
0.9% ( -15% - 19%) 0.755
OrNotHighMed 480.49 (2.9%) 485.66 (3.9%)
1.1% ( -5% - 8%) 0.326
Fuzzy1 35.35 (7.6%) 35.84 (7.5%)
1.4% ( -12% - 17%) 0.562
{code}
So while the index size reduction is relatively small, I don't see a real
significant performance impact either. Curious if anyone else sees something
interesting in these numbers.
Finally, I'll mention that I observed pretty similar behavior when running our
internal (Amazon product search) benchmarking tools. Pretty small impact to
index size but no significant performance hit either.
> Explore increasing the allowable exceptions in PForUtil
> -------------------------------------------------------
>
> Key: LUCENE-9877
> URL: https://issues.apache.org/jira/browse/LUCENE-9877
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> Piggybacking a little off of the investigation I was doing over in
> LUCENE-9850 I thought it might also be worth-while exploring the impact of
> increasing the number of allowable exceptions in PForUtil. The aim of this
> investigation is to see if we could reduce index size by allowing for more
> exceptions without significant negative impact to performance.
> PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to
> encode the number of exceptions (using the remaining 3 bits of the byte used
> to also encode the number of bits-per-value, which requires 5 bits). Each
> exception used is encoded with a full byte, using a maximum of 3 bytes per
> block.
> It seems to me like 14 might be a more ideal number of exceptions if index
> size is the driving motivation. My thought process is that, in the
> worst-case, 14 exceptions would be used to save only a single bit-per-value
> in the corresponding block. With 128 entries per block, this would save 16
> bytes. To encode the exception count, we'd need to use an extra byte per
> block since we need more than the current 3 bits. So with that extra byte in
> addition to the 14 bytes used to encode the exception values, we would save a
> single byte in total (just slightly better than breaking even). If we need
> fewer than the 14 exceptions, or if we're able to save more than 1
> bit-per-value, it's all additional savings.
> As a first experiment, I think testing 7 exceptions might be a good
> middle-ground. Seven exceptions is the most we can encode with the 3 bits we
> have allocated today. So for 7 exceptions, we don't need the extra byte for
> encoding the exception count, and in the worst case of needing all 7 to save
> 1 bit-per-value, we'd save a total of 9 bytes in a block. I suppose the
> question is what kind of performance hit we might observe due to decoding
> more exceptions.
> I'll post some results here for discussion or at least for public record of
> my work for future reference.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]