[
https://issues.apache.org/jira/browse/LUCENE-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309752#comment-17309752
]
Greg Miller commented on LUCENE-9877:
-------------------------------------
If anyone wants to have a closer look at my methodology (and double-check
things), here are the relevant branches:
*
[pfor-bpv-tool|https://github.com/gsmiller/lucene/tree/LUCENE-9877/pfor-bpv-tool]:
The tool lives in o.a.l.codecs.BitsPerValueTool.java and the index path needs
to be specified as the single argument when running. It can be run against an
index using PFOR encoded freqs/pos/offsets/payloads with an exception count up
to 7. It will print out the histograms you see above.
*
[0-exceptions|https://github.com/gsmiller/lucene/tree/LUCENE-9877/0-exceptions]:
Just takes away exceptions in PFOR, essentially turning it into vanilla FOR.
You can rebuild an index with this fork to force no exceptions to be used.
*
[7-exceptions|https://github.com/gsmiller/lucene/tree/LUCENE-9877/7-exceptions]:
Allows up to 7 exceptions in PFOR. You can rebuild an index with this for to
allow for more exceptions.
> Explore increasing the allowable exceptions in PForUtil
> -------------------------------------------------------
>
> Key: LUCENE-9877
> URL: https://issues.apache.org/jira/browse/LUCENE-9877
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> Piggybacking a little off of the investigation I was doing over in
> LUCENE-9850 I thought it might also be worth-while exploring the impact of
> increasing the number of allowable exceptions in PForUtil. The aim of this
> investigation is to see if we could reduce index size by allowing for more
> exceptions without significant negative impact to performance.
> PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to
> encode the number of exceptions (using the remaining 3 bits of the byte used
> to also encode the number of bits-per-value, which requires 5 bits). Each
> exception used is encoded with a two full bytes, using a maximum of 6 bytes
> per block.
> It seems to me like 7 might be a more ideal number of exceptions if index
> size is the driving motivation. My thought process is that, in the
> worst-case, 7 exceptions would be used to save only a single bit-per-value in
> the corresponding block. With 128 entries per block, this would save 16
> bytes. So with 14 bytes used to encode the exception values (7 x 2 bytes per
> exception), we would save a two bytes in total (just slightly better than
> breaking even). If we need fewer than the 7 exceptions, or if we're able to
> save more than 1 bit-per-value, it's all additional savings. I suppose the
> question is what kind of performance hit we might observe due to decoding
> more exceptions.
> Also note that 7 exceptions is the max we can encode with the 3 bits we
> currently have available for the number of exceptions. So moving to 8
> exceptions would not only take 16 bytes to encode the exceptions (if using
> all of them), but we'd need one more byte per block to encode the exception
> count. So in the worst case of using all 8 exceptions to save 1 bit per
> value, we'd actually be worse off.
> I'll post some results here for discussion or at least for public record of
> my work for future reference.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]