Re: PFOR for docids?

Adrien Grand Tue, 16 Mar 2021 02:11:09 -0700

Hi Greg,

You guessed b) right. I had done some microbenchmarks
<https://github.com/jpountz/decode-128-ints-benchmark> that suggested that
the optimization of the prefix sum was important performance-wise compared
to unpacking and then doing the prefix sum.

The other factor that played a role to me is that frequencies and positions
are usually less performance-sensitive than doc IDs. For instance
conjunctions only decode frequencies after reaching agreement across all
clauses, and phrase queries only decode positions if term frequencies
suggest that the current document could have a competitive score.

I would be interested in discussing using PFOR for everything. At Elastic
we've been looking into building 3-gram indexes to perform infix search,
and noticed that we would save a lot of space if we moved to PFOR rather
than FOR for doc ID deltas. The reason is that these postings of 3-grams
are very dense, and saving one bit per value saves a lot more space
relatively when blocks have 5 bits per value than when they have e.g. 15
bits per value.

On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <[email protected]> wrote:

> Hi folks-
>
> I'm curious to understand the history/context of using PFOR for positions
> and frequencies while continuing to use basic FOR for docid encoding. I've
> done my best to turn up any past conversations on this, but wasn't able to
> find much. Apologies if I missed it in my digging! From what I've gathered,
> the basic FOR encoding was introduced to Lucene with LUCENE-3892
> <https://issues.apache.org/jira/browse/LUCENE-3892> (which was a
> continuation of LUCENE-1410
> <https://issues.apache.org/jira/browse/LUCENE-1410>). While PFOR had been
> discussed plenty in the earlier issues, I gather that it wasn't actually
> committed until LUCENE-9027
> <https://issues.apache.org/jira/browse/LUCENE-9027>. Hopefully I've got
> that much right. And it appears at that time to have been introduced for
> positions and frequencies, but not docids.
>
> Is the reasoning here that, a) since docids are delta-encoded already,
> outliers/exceptions will be less likely/beneficial, and b) FOR allows for
> an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum)
> which can't be utilized with PFOR, since the exceptions must be patched
> in before decoding deltas? Are the other reasons FOR continues to be used
> for docids that I'm overlooking?
>
> I'm curious as I recently ran some internal benchmarks on the Amazon
> product search engine replacing FOR with PFOR for docids delta encoding,
> and saw an index size reduction of -0.93% while also improving our red-line
> queries/sec by +1.0%. I expected the index size reduction but wasn't
> expecting to see a QPS improvement, which I haven't yet been able to
> explain. I'm wondering if there are some good reasons to keep using FOR for
> docids, or if there'd be any appetite to discuss using PFOR for everything?
> Again, apologies if I've overlooked some past discussion in my digging. Any
> history/context is much appreciated!
>
> Cheers,
> -Greg
>

-- 
Adrien

Re: PFOR for docids?

Reply via email to