This isn't only appropriate, but also the preferred way to move such discussions forward. Thanks Greg!
On Wed, Mar 17, 2021 at 11:27 PM Greg Miller <[email protected]> wrote: > I went ahead and created a Jira issue to track this investigation: > https://issues.apache.org/jira/browse/LUCENE-9850. I thought a Jira > issue might help us better track my findings as I work on this. I hope > this was appropriate, but if not, please let me know. Thanks again! > > Cheers, > -Greg > > On Tue, Mar 16, 2021 at 12:51 PM Greg Miller <[email protected]> wrote: > > > > Hi Adrien- > > > > Thanks for all the additional context and the benchmark link! > > > > > The reason is that these postings of 3-grams are very dense, and > saving one bit per value saves a lot more space relatively when blocks have > 5 bits per value than when they have e.g. 15 bits per value. > > > > That's an interesting observation. I'm working on gathering this same > > bpv stat in our application right now, so this is useful to keep in > > mind. > > > > It seems like it would be valuable to measure the performance impact > > of moving to PFOR for doc IDs in contrast to what it might save in > > index size. While I'm able to do this on our internal system with our > > own benchmarks, that's a very specific use-case and may not translate > > well to a more general upstream solution. Do you have any guidance on > > how this type of evaluation is generally done in the context of an > > upstream change? As a first step, I can open a Jira issue to track the > > evaluation if you think that would be useful. Thanks again! > > > > Cheers, > > -Greg > > > > > > On Tue, Mar 16, 2021 at 2:11 AM Adrien Grand <[email protected]> wrote: > > > > > > Hi Greg, > > > > > > You guessed b) right. I had done some microbenchmarks that suggested > that the optimization of the prefix sum was important performance-wise > compared to unpacking and then doing the prefix sum. > > > > > > The other factor that played a role to me is that frequencies and > positions are usually less performance-sensitive than doc IDs. For instance > conjunctions only decode frequencies after reaching agreement across all > clauses, and phrase queries only decode positions if term frequencies > suggest that the current document could have a competitive score. > > > > > > I would be interested in discussing using PFOR for everything. At > Elastic we've been looking into building 3-gram indexes to perform infix > search, and noticed that we would save a lot of space if we moved to PFOR > rather than FOR for doc ID deltas. The reason is that these postings of > 3-grams are very dense, and saving one bit per value saves a lot more space > relatively when blocks have 5 bits per value than when they have e.g. 15 > bits per value. > > > > > > On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <[email protected]> > wrote: > > >> > > >> Hi folks- > > >> > > >> I'm curious to understand the history/context of using PFOR for > positions and frequencies while continuing to use basic FOR for docid > encoding. I've done my best to turn up any past conversations on this, but > wasn't able to find much. Apologies if I missed it in my digging! From what > I've gathered, the basic FOR encoding was introduced to Lucene with > LUCENE-3892 (which was a continuation of LUCENE-1410). While PFOR had been > discussed plenty in the earlier issues, I gather that it wasn't actually > committed until LUCENE-9027. Hopefully I've got that much right. And it > appears at that time to have been introduced for positions and frequencies, > but not docids. > > >> > > >> Is the reasoning here that, a) since docids are delta-encoded > already, outliers/exceptions will be less likely/beneficial, and b) FOR > allows for an optimization in decoding the deltas (via. > ForUtil#decodeAndPrefixSum) which can't be utilized with PFOR, since the > exceptions must be patched in before decoding deltas? Are the other reasons > FOR continues to be used for docids that I'm overlooking? > > >> > > >> I'm curious as I recently ran some internal benchmarks on the Amazon > product search engine replacing FOR with PFOR for docids delta encoding, > and saw an index size reduction of -0.93% while also improving our red-line > queries/sec by +1.0%. I expected the index size reduction but wasn't > expecting to see a QPS improvement, which I haven't yet been able to > explain. I'm wondering if there are some good reasons to keep using FOR for > docids, or if there'd be any appetite to discuss using PFOR for everything? > Again, apologies if I've overlooked some past discussion in my digging. Any > history/context is much appreciated! > > >> > > >> Cheers, > > >> -Greg > > > > > > > > > > > > -- > > > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Adrien
