[ 
https://issues.apache.org/jira/browse/LUCENE-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309021#comment-17309021
 ] 

Greg Miller commented on LUCENE-9877:
-------------------------------------

Here are my results from comparing 7 allowable exceptions in PFOR to the 
baseline 3:

Similar to my approach in 
[LUCENE-9850|https://issues.apache.org/jira/browse/LUCENE-9850], I threw 
together a tool that visualizes the bits-per-value used across all PFOR encoded 
blocks. I also added in a visualization of the exception count used across all 
PFOR blocks. Here's what I found by building the Wikipedia EN index with the 
"wikimedium10m" source in luceneutil (all zero-valued histogram entries were 
truncated for readability):


{code:java}
BASELINE (3 exceptions)

PFOR BPV Histogram
 0 ***                                                 [4.02 %] (239453 of 
5959788)
 1 **                                                  [2.78 %] (165409 of 
5959788)
 2 **************************                          [51.98 %] (3097737 of 
5959788)
 3 *****************                                   [32.61 %] (1943685 of 
5959788)
 4 ****                                                [7.06 %] (420552 of 
5959788)
 5 *                                                   [1.49 %] (88700 of 
5959788)
 6 *                                                   [0.07 %] (4097 of 
5959788)
 7 *                                                   [0.00 %] (153 of 5959788)
 8 *                                                   [0.00 %] (2 of 5959788)
Total bytes used: 234757569
PFOR Exceptions Used Histogram
 0 **************************                          [51.83 %] (3088921 of 
5959788)
 1 ************                                        [22.78 %] (1357732 of 
5959788)
 2 ********                                            [14.63 %] (872061 of 
5959788)
 3 ******                                              [10.76 %] (641074 of 
5959788)

CANDIDATE (7 exceptions)

PFOR BPV Histogram
 0 ***                                                 [4.02 %] (239583 of 
5959788)
 1 *****                                               [8.75 %] (521487 of 
5959788)
 2 ********************************                    [62.40 %] (3718819 of 
5959788)
 3 ***********                                         [20.26 %] (1207623 of 
5959788)
 4 **                                                  [3.71 %] (221391 of 
5959788)
 5 *                                                   [0.82 %] (48939 of 
5959788)
 6 *                                                   [0.03 %] (1868 of 
5959788)
 7 *                                                   [0.00 %] (76 of 5959788)
 8 *                                                   [0.00 %] (2 of 5959788)
Total bytes used: 216891765
PFOR Exceptions Used Histogram
 0 ****************                                    [30.93 %] (1843145 of 
5959788)
 1 **********                                          [18.46 %] (1100315 of 
5959788)
 2 *******                                             [13.38 %] (797217 of 
5959788)
 3 ******                                              [10.34 %] (615991 of 
5959788)
 4 *****                                               [8.36 %] (498042 of 
5959788)
 5 ****                                                [7.03 %] (419005 of 
5959788)
 6 ****                                                [6.11 %] (364388 of 
5959788)
 7 ***                                                 [5.40 %] (321685 of 
5959788)
{code}

So we can observe the additional exceptions getting utilized and the 
bits-per-value generally shifting lower. This seems to amount to a ~7.6% 
decrease in space used for all of the PFOR blocks in aggregate.

Looking at the overall index size though, I only see a ~0.5% decrease in the 
index size. :

{code:java}
3816536 
wikimedium10m.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
3795532 
wikimedium10m.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
{code}

I suspect the PFOR data in this index (frequencies, positions, offsets and 
payloads) just aren't a big enough portion of the overall index to move the 
needle much in overall size. If my tool is accurate, it looks like only ~0.2 GB 
of the ~3.8 GB index is PFOR data.

As for performance implications, running a luceneutil benchmark (using "-source 
wikimedium10m") shows the following:

{code:java}
                    TaskQPS baseline      StdDevQPS pfor 7 exceptions      
StdDev                Pct diff p-value
   HighTermDayOfYearSort      139.37     (10.8%)      135.30     (10.2%)   
-2.9% ( -21% -   20%) 0.379
       HighTermMonthSort      134.56     (13.4%)      130.65     (11.9%)   
-2.9% ( -24% -   25%) 0.470
              TermDTSort      100.66     (12.9%)       97.96     (13.5%)   
-2.7% ( -25% -   27%) 0.521
    HighTermTitleBDVSort      130.20      (9.6%)      127.06      (8.4%)   
-2.4% ( -18% -   17%) 0.398
         MedSloppyPhrase       45.83      (3.7%)       44.82      (3.8%)   
-2.2% (  -9% -    5%) 0.063
           OrHighNotHigh      505.30      (3.7%)      498.15      (4.6%)   
-1.4% (  -9% -    7%) 0.288
               OrHighLow      413.80      (5.0%)      408.04      (4.9%)   
-1.4% ( -10% -    8%) 0.371
              AndHighLow      575.78      (3.5%)      568.21      (3.6%)   
-1.3% (  -8% -    5%) 0.241
               MedPhrase       19.20      (2.5%)       18.98      (2.4%)   
-1.1% (  -5% -    3%) 0.138
            OrHighNotLow      504.91      (4.7%)      499.19      (4.2%)   
-1.1% (  -9% -    8%) 0.423
         LowSloppyPhrase       44.42      (4.0%)       44.01      (3.1%)   
-0.9% (  -7% -    6%) 0.413
            OrNotHighLow      789.01      (3.7%)      781.98      (4.4%)   
-0.9% (  -8% -    7%) 0.488
             LowSpanNear      100.30      (2.1%)       99.42      (2.0%)   
-0.9% (  -4% -    3%) 0.178
    HighIntervalsOrdered       13.86      (2.0%)       13.75      (1.9%)   
-0.8% (  -4% -    3%) 0.174
                 Prefix3      230.15      (2.8%)      228.24      (3.3%)   
-0.8% (  -6% -    5%) 0.394
        HighSloppyPhrase        1.77      (3.8%)        1.76      (3.4%)   
-0.8% (  -7% -    6%) 0.477
   BrowseMonthTaxoFacets        5.01      (2.2%)        4.97      (3.5%)   
-0.8% (  -6% -    4%) 0.396
             AndHighHigh       35.38      (2.8%)       35.12      (3.1%)   
-0.7% (  -6% -    5%) 0.428
                Wildcard       63.19      (1.9%)       62.83      (1.9%)   
-0.6% (  -4% -    3%) 0.337
            HighSpanNear       11.63      (2.8%)       11.57      (2.2%)   
-0.6% (  -5% -    4%) 0.492
            OrHighNotMed      516.53      (4.0%)      513.71      (4.1%)   
-0.5% (  -8% -    7%) 0.670
BrowseDayOfYearTaxoFacets        4.20      (2.5%)        4.18      (2.9%)   
-0.5% (  -5% -    5%) 0.541
               LowPhrase       51.17      (3.0%)       50.97      (3.0%)   
-0.4% (  -6% -    5%) 0.674
                 MedTerm     1080.72      (5.4%)     1076.53      (4.9%)   
-0.4% ( -10% -   10%) 0.812
    BrowseDateTaxoFacets        4.19      (2.5%)        4.18      (2.8%)   
-0.4% (  -5% -    5%) 0.657
                HighTerm      923.10      (3.8%)      920.08      (4.1%)   
-0.3% (  -7% -    7%) 0.795
                 Respell       40.97      (1.8%)       40.84      (2.2%)   
-0.3% (  -4% -    3%) 0.620
               OrHighMed       88.93      (2.7%)       88.72      (1.7%)   
-0.2% (  -4% -    4%) 0.739
              HighPhrase      314.90      (2.9%)      314.29      (4.1%)   
-0.2% (  -6% -    6%) 0.863
BrowseDayOfYearSSDVFacets       11.39      (5.8%)       11.38      (5.8%)   
-0.1% ( -11% -   12%) 0.957
   BrowseMonthSSDVFacets       12.55      (3.3%)       12.54      (3.3%)   
-0.1% (  -6% -    6%) 0.924
                PKLookup      146.67      (4.3%)      146.58      (3.0%)   
-0.1% (  -7% -    7%) 0.960
                  IntNRQ       53.49      (0.7%)       53.50      (0.7%)    
0.0% (  -1% -    1%) 0.935
           OrNotHighHigh      480.86      (3.8%)      481.91      (3.9%)    
0.2% (  -7% -    8%) 0.857
             MedSpanNear      130.25      (2.5%)      130.55      (2.5%)    
0.2% (  -4% -    5%) 0.775
              AndHighMed      159.94      (2.4%)      160.41      (2.6%)    
0.3% (  -4% -    5%) 0.704
                 LowTerm     1030.42      (3.7%)     1034.09      (5.3%)    
0.4% (  -8% -    9%) 0.804
              OrHighHigh       14.59      (2.2%)       14.65      (1.3%)    
0.4% (  -3% -    4%) 0.514
                  Fuzzy2       51.29      (7.7%)       51.74      (9.7%)    
0.9% ( -15% -   19%) 0.755
            OrNotHighMed      480.49      (2.9%)      485.66      (3.9%)    
1.1% (  -5% -    8%) 0.326
                  Fuzzy1       35.35      (7.6%)       35.84      (7.5%)    
1.4% ( -12% -   17%) 0.562
{code}

So while the index size reduction is relatively small, I don't see a real 
significant performance impact either. Curious if anyone else sees something 
interesting in these numbers.

Finally, I'll mention that I observed pretty similar behavior when running our 
internal (Amazon product search) benchmarking tools. Pretty small impact to 
index size but no significant performance hit either.

> Explore increasing the allowable exceptions in PForUtil
> -------------------------------------------------------
>
>                 Key: LUCENE-9877
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9877
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> Piggybacking a little off of the investigation I was doing over in 
> LUCENE-9850 I thought it might also be worth-while exploring the impact of 
> increasing the number of allowable exceptions in PForUtil. The aim of this 
> investigation is to see if we could reduce index size by allowing for more 
> exceptions without significant negative impact to performance.
> PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to 
> encode the number of exceptions (using the remaining 3 bits of the byte used 
> to also encode the number of bits-per-value, which requires 5 bits). Each 
> exception used is encoded with a full byte, using a maximum of 3 bytes per 
> block.
> It seems to me like 14 might be a more ideal number of exceptions if index 
> size is the driving motivation. My thought process is that, in the 
> worst-case, 14 exceptions would be used to save only a single bit-per-value 
> in the corresponding block. With 128 entries per block, this would save 16 
> bytes. To encode the exception count, we'd need to use an extra byte per 
> block since we need more than the current 3 bits. So with that extra byte in 
> addition to the 14 bytes used to encode the exception values, we would save a 
> single byte in total (just slightly better than breaking even). If we need 
> fewer than the 14 exceptions, or if we're able to save more than 1 
> bit-per-value, it's all additional savings.
> As a first experiment, I think testing 7 exceptions might be a good 
> middle-ground. Seven exceptions is the most we can encode with the 3 bits we 
> have allocated today. So for 7 exceptions, we don't need the extra byte for 
> encoding the exception count, and in the worst case of needing all 7 to save 
> 1 bit-per-value, we'd save a total of 9 bytes in a block. I suppose the 
> question is what kind of performance hit we might observe due to decoding 
> more exceptions.
> I'll post some results here for discussion or at least for public record of 
> my work for future reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to