Hi Adrien-

Thanks for all the additional context and the benchmark link!

> The reason is that these postings of 3-grams are very dense, and saving one 
> bit per value saves a lot more space relatively when blocks have 5 bits per 
> value than when they have e.g. 15 bits per value.

That's an interesting observation. I'm working on gathering this same
bpv stat in our application right now, so this is useful to keep in
mind.

It seems like it would be valuable to measure the performance impact
of moving to PFOR for doc IDs in contrast to what it might save in
index size. While I'm able to do this on our internal system with our
own benchmarks, that's a very specific use-case and may not translate
well to a more general upstream solution. Do you have any guidance on
how this type of evaluation is generally done in the context of an
upstream change? As a first step, I can open a Jira issue to track the
evaluation if you think that would be useful. Thanks again!

Cheers,
-Greg


On Tue, Mar 16, 2021 at 2:11 AM Adrien Grand <[email protected]> wrote:
>
> Hi Greg,
>
> You guessed b) right. I had done some microbenchmarks that suggested that the 
> optimization of the prefix sum was important performance-wise compared to 
> unpacking and then doing the prefix sum.
>
> The other factor that played a role to me is that frequencies and positions 
> are usually less performance-sensitive than doc IDs. For instance 
> conjunctions only decode frequencies after reaching agreement across all 
> clauses, and phrase queries only decode positions if term frequencies suggest 
> that the current document could have a competitive score.
>
> I would be interested in discussing using PFOR for everything. At Elastic 
> we've been looking into building 3-gram indexes to perform infix search, and 
> noticed that we would save a lot of space if we moved to PFOR rather than FOR 
> for doc ID deltas. The reason is that these postings of 3-grams are very 
> dense, and saving one bit per value saves a lot more space relatively when 
> blocks have 5 bits per value than when they have e.g. 15 bits per value.
>
> On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <[email protected]> wrote:
>>
>> Hi folks-
>>
>> I'm curious to understand the history/context of using PFOR for positions 
>> and frequencies while continuing to use basic FOR for docid encoding. I've 
>> done my best to turn up any past conversations on this, but wasn't able to 
>> find much. Apologies if I missed it in my digging! From what I've gathered, 
>> the basic FOR encoding was introduced to Lucene with LUCENE-3892 (which was 
>> a continuation of LUCENE-1410). While PFOR had been discussed plenty in the 
>> earlier issues, I gather that it wasn't actually committed until 
>> LUCENE-9027. Hopefully I've got that much right. And it appears at that time 
>> to have been introduced for positions and frequencies, but not docids.
>>
>> Is the reasoning here that, a) since docids are delta-encoded already, 
>> outliers/exceptions will be less likely/beneficial, and b) FOR allows for an 
>> optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum) which 
>> can't be utilized with PFOR, since the exceptions must be patched in before 
>> decoding deltas? Are the other reasons FOR continues to be used for docids 
>> that I'm overlooking?
>>
>> I'm curious as I recently ran some internal benchmarks on the Amazon product 
>> search engine replacing FOR with PFOR for docids delta encoding, and saw an 
>> index size reduction of -0.93% while also improving our red-line queries/sec 
>> by +1.0%. I expected the index size reduction but wasn't expecting to see a 
>> QPS improvement, which I haven't yet been able to explain. I'm wondering if 
>> there are some good reasons to keep using FOR for docids, or if there'd be 
>> any appetite to discuss using PFOR for everything? Again, apologies if I've 
>> overlooked some past discussion in my digging. Any history/context is much 
>> appreciated!
>>
>> Cheers,
>> -Greg
>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to