Hi Greg, You guessed b) right. I had done some microbenchmarks <https://github.com/jpountz/decode-128-ints-benchmark> that suggested that the optimization of the prefix sum was important performance-wise compared to unpacking and then doing the prefix sum.
The other factor that played a role to me is that frequencies and positions are usually less performance-sensitive than doc IDs. For instance conjunctions only decode frequencies after reaching agreement across all clauses, and phrase queries only decode positions if term frequencies suggest that the current document could have a competitive score. I would be interested in discussing using PFOR for everything. At Elastic we've been looking into building 3-gram indexes to perform infix search, and noticed that we would save a lot of space if we moved to PFOR rather than FOR for doc ID deltas. The reason is that these postings of 3-grams are very dense, and saving one bit per value saves a lot more space relatively when blocks have 5 bits per value than when they have e.g. 15 bits per value. On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <[email protected]> wrote: > Hi folks- > > I'm curious to understand the history/context of using PFOR for positions > and frequencies while continuing to use basic FOR for docid encoding. I've > done my best to turn up any past conversations on this, but wasn't able to > find much. Apologies if I missed it in my digging! From what I've gathered, > the basic FOR encoding was introduced to Lucene with LUCENE-3892 > <https://issues.apache.org/jira/browse/LUCENE-3892> (which was a > continuation of LUCENE-1410 > <https://issues.apache.org/jira/browse/LUCENE-1410>). While PFOR had been > discussed plenty in the earlier issues, I gather that it wasn't actually > committed until LUCENE-9027 > <https://issues.apache.org/jira/browse/LUCENE-9027>. Hopefully I've got > that much right. And it appears at that time to have been introduced for > positions and frequencies, but not docids. > > Is the reasoning here that, a) since docids are delta-encoded already, > outliers/exceptions will be less likely/beneficial, and b) FOR allows for > an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum) > which can't be utilized with PFOR, since the exceptions must be patched > in before decoding deltas? Are the other reasons FOR continues to be used > for docids that I'm overlooking? > > I'm curious as I recently ran some internal benchmarks on the Amazon > product search engine replacing FOR with PFOR for docids delta encoding, > and saw an index size reduction of -0.93% while also improving our red-line > queries/sec by +1.0%. I expected the index size reduction but wasn't > expecting to see a QPS improvement, which I haven't yet been able to > explain. I'm wondering if there are some good reasons to keep using FOR for > docids, or if there'd be any appetite to discuss using PFOR for everything? > Again, apologies if I've overlooked some past discussion in my digging. Any > history/context is much appreciated! > > Cheers, > -Greg > -- Adrien
