I went ahead and created a Jira issue to track this investigation: https://issues.apache.org/jira/browse/LUCENE-9850. I thought a Jira issue might help us better track my findings as I work on this. I hope this was appropriate, but if not, please let me know. Thanks again!
Cheers, -Greg On Tue, Mar 16, 2021 at 12:51 PM Greg Miller <[email protected]> wrote: > > Hi Adrien- > > Thanks for all the additional context and the benchmark link! > > > The reason is that these postings of 3-grams are very dense, and saving one > > bit per value saves a lot more space relatively when blocks have 5 bits per > > value than when they have e.g. 15 bits per value. > > That's an interesting observation. I'm working on gathering this same > bpv stat in our application right now, so this is useful to keep in > mind. > > It seems like it would be valuable to measure the performance impact > of moving to PFOR for doc IDs in contrast to what it might save in > index size. While I'm able to do this on our internal system with our > own benchmarks, that's a very specific use-case and may not translate > well to a more general upstream solution. Do you have any guidance on > how this type of evaluation is generally done in the context of an > upstream change? As a first step, I can open a Jira issue to track the > evaluation if you think that would be useful. Thanks again! > > Cheers, > -Greg > > > On Tue, Mar 16, 2021 at 2:11 AM Adrien Grand <[email protected]> wrote: > > > > Hi Greg, > > > > You guessed b) right. I had done some microbenchmarks that suggested that > > the optimization of the prefix sum was important performance-wise compared > > to unpacking and then doing the prefix sum. > > > > The other factor that played a role to me is that frequencies and positions > > are usually less performance-sensitive than doc IDs. For instance > > conjunctions only decode frequencies after reaching agreement across all > > clauses, and phrase queries only decode positions if term frequencies > > suggest that the current document could have a competitive score. > > > > I would be interested in discussing using PFOR for everything. At Elastic > > we've been looking into building 3-gram indexes to perform infix search, > > and noticed that we would save a lot of space if we moved to PFOR rather > > than FOR for doc ID deltas. The reason is that these postings of 3-grams > > are very dense, and saving one bit per value saves a lot more space > > relatively when blocks have 5 bits per value than when they have e.g. 15 > > bits per value. > > > > On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <[email protected]> wrote: > >> > >> Hi folks- > >> > >> I'm curious to understand the history/context of using PFOR for positions > >> and frequencies while continuing to use basic FOR for docid encoding. I've > >> done my best to turn up any past conversations on this, but wasn't able to > >> find much. Apologies if I missed it in my digging! From what I've > >> gathered, the basic FOR encoding was introduced to Lucene with LUCENE-3892 > >> (which was a continuation of LUCENE-1410). While PFOR had been discussed > >> plenty in the earlier issues, I gather that it wasn't actually committed > >> until LUCENE-9027. Hopefully I've got that much right. And it appears at > >> that time to have been introduced for positions and frequencies, but not > >> docids. > >> > >> Is the reasoning here that, a) since docids are delta-encoded already, > >> outliers/exceptions will be less likely/beneficial, and b) FOR allows for > >> an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum) > >> which can't be utilized with PFOR, since the exceptions must be patched in > >> before decoding deltas? Are the other reasons FOR continues to be used for > >> docids that I'm overlooking? > >> > >> I'm curious as I recently ran some internal benchmarks on the Amazon > >> product search engine replacing FOR with PFOR for docids delta encoding, > >> and saw an index size reduction of -0.93% while also improving our > >> red-line queries/sec by +1.0%. I expected the index size reduction but > >> wasn't expecting to see a QPS improvement, which I haven't yet been able > >> to explain. I'm wondering if there are some good reasons to keep using FOR > >> for docids, or if there'd be any appetite to discuss using PFOR for > >> everything? Again, apologies if I've overlooked some past discussion in my > >> digging. Any history/context is much appreciated! > >> > >> Cheers, > >> -Greg > > > > > > > > -- > > Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
