I went ahead and created a Jira issue to track this investigation:
https://issues.apache.org/jira/browse/LUCENE-9850. I thought a Jira
issue might help us better track my findings as I work on this. I hope
this was appropriate, but if not, please let me know. Thanks again!

Cheers,
-Greg

On Tue, Mar 16, 2021 at 12:51 PM Greg Miller <[email protected]> wrote:
>
> Hi Adrien-
>
> Thanks for all the additional context and the benchmark link!
>
> > The reason is that these postings of 3-grams are very dense, and saving one 
> > bit per value saves a lot more space relatively when blocks have 5 bits per 
> > value than when they have e.g. 15 bits per value.
>
> That's an interesting observation. I'm working on gathering this same
> bpv stat in our application right now, so this is useful to keep in
> mind.
>
> It seems like it would be valuable to measure the performance impact
> of moving to PFOR for doc IDs in contrast to what it might save in
> index size. While I'm able to do this on our internal system with our
> own benchmarks, that's a very specific use-case and may not translate
> well to a more general upstream solution. Do you have any guidance on
> how this type of evaluation is generally done in the context of an
> upstream change? As a first step, I can open a Jira issue to track the
> evaluation if you think that would be useful. Thanks again!
>
> Cheers,
> -Greg
>
>
> On Tue, Mar 16, 2021 at 2:11 AM Adrien Grand <[email protected]> wrote:
> >
> > Hi Greg,
> >
> > You guessed b) right. I had done some microbenchmarks that suggested that 
> > the optimization of the prefix sum was important performance-wise compared 
> > to unpacking and then doing the prefix sum.
> >
> > The other factor that played a role to me is that frequencies and positions 
> > are usually less performance-sensitive than doc IDs. For instance 
> > conjunctions only decode frequencies after reaching agreement across all 
> > clauses, and phrase queries only decode positions if term frequencies 
> > suggest that the current document could have a competitive score.
> >
> > I would be interested in discussing using PFOR for everything. At Elastic 
> > we've been looking into building 3-gram indexes to perform infix search, 
> > and noticed that we would save a lot of space if we moved to PFOR rather 
> > than FOR for doc ID deltas. The reason is that these postings of 3-grams 
> > are very dense, and saving one bit per value saves a lot more space 
> > relatively when blocks have 5 bits per value than when they have e.g. 15 
> > bits per value.
> >
> > On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <[email protected]> wrote:
> >>
> >> Hi folks-
> >>
> >> I'm curious to understand the history/context of using PFOR for positions 
> >> and frequencies while continuing to use basic FOR for docid encoding. I've 
> >> done my best to turn up any past conversations on this, but wasn't able to 
> >> find much. Apologies if I missed it in my digging! From what I've 
> >> gathered, the basic FOR encoding was introduced to Lucene with LUCENE-3892 
> >> (which was a continuation of LUCENE-1410). While PFOR had been discussed 
> >> plenty in the earlier issues, I gather that it wasn't actually committed 
> >> until LUCENE-9027. Hopefully I've got that much right. And it appears at 
> >> that time to have been introduced for positions and frequencies, but not 
> >> docids.
> >>
> >> Is the reasoning here that, a) since docids are delta-encoded already, 
> >> outliers/exceptions will be less likely/beneficial, and b) FOR allows for 
> >> an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum) 
> >> which can't be utilized with PFOR, since the exceptions must be patched in 
> >> before decoding deltas? Are the other reasons FOR continues to be used for 
> >> docids that I'm overlooking?
> >>
> >> I'm curious as I recently ran some internal benchmarks on the Amazon 
> >> product search engine replacing FOR with PFOR for docids delta encoding, 
> >> and saw an index size reduction of -0.93% while also improving our 
> >> red-line queries/sec by +1.0%. I expected the index size reduction but 
> >> wasn't expecting to see a QPS improvement, which I haven't yet been able 
> >> to explain. I'm wondering if there are some good reasons to keep using FOR 
> >> for docids, or if there'd be any appetite to discuss using PFOR for 
> >> everything? Again, apologies if I've overlooked some past discussion in my 
> >> digging. Any history/context is much appreciated!
> >>
> >> Cheers,
> >> -Greg
> >
> >
> >
> > --
> > Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to