Hi folks- I'm curious to understand the history/context of using PFOR for positions and frequencies while continuing to use basic FOR for docid encoding. I've done my best to turn up any past conversations on this, but wasn't able to find much. Apologies if I missed it in my digging! From what I've gathered, the basic FOR encoding was introduced to Lucene with LUCENE-3892 <https://issues.apache.org/jira/browse/LUCENE-3892> (which was a continuation of LUCENE-1410 <https://issues.apache.org/jira/browse/LUCENE-1410>). While PFOR had been discussed plenty in the earlier issues, I gather that it wasn't actually committed until LUCENE-9027 <https://issues.apache.org/jira/browse/LUCENE-9027>. Hopefully I've got that much right. And it appears at that time to have been introduced for positions and frequencies, but not docids.
Is the reasoning here that, a) since docids are delta-encoded already, outliers/exceptions will be less likely/beneficial, and b) FOR allows for an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum) which can't be utilized with PFOR, since the exceptions must be patched in before decoding deltas? Are the other reasons FOR continues to be used for docids that I'm overlooking? I'm curious as I recently ran some internal benchmarks on the Amazon product search engine replacing FOR with PFOR for docids delta encoding, and saw an index size reduction of -0.93% while also improving our red-line queries/sec by +1.0%. I expected the index size reduction but wasn't expecting to see a QPS improvement, which I haven't yet been able to explain. I'm wondering if there are some good reasons to keep using FOR for docids, or if there'd be any appetite to discuss using PFOR for everything? Again, apologies if I've overlooked some past discussion in my digging. Any history/context is much appreciated! Cheers, -Greg
