Re: PFOR for docids?

Adrien Grand Thu, 18 Mar 2021 04:13:18 -0700

This isn't only appropriate, but also the preferred way to move such
discussions forward. Thanks Greg!


On Wed, Mar 17, 2021 at 11:27 PM Greg Miller <[email protected]> wrote:

> I went ahead and created a Jira issue to track this investigation:
> https://issues.apache.org/jira/browse/LUCENE-9850. I thought a Jira
> issue might help us better track my findings as I work on this. I hope
> this was appropriate, but if not, please let me know. Thanks again!
>
> Cheers,
> -Greg
>
> On Tue, Mar 16, 2021 at 12:51 PM Greg Miller <[email protected]> wrote:
> >
> > Hi Adrien-
> >
> > Thanks for all the additional context and the benchmark link!
> >
> > > The reason is that these postings of 3-grams are very dense, and
> saving one bit per value saves a lot more space relatively when blocks have
> 5 bits per value than when they have e.g. 15 bits per value.
> >
> > That's an interesting observation. I'm working on gathering this same
> > bpv stat in our application right now, so this is useful to keep in
> > mind.
> >
> > It seems like it would be valuable to measure the performance impact
> > of moving to PFOR for doc IDs in contrast to what it might save in
> > index size. While I'm able to do this on our internal system with our
> > own benchmarks, that's a very specific use-case and may not translate
> > well to a more general upstream solution. Do you have any guidance on
> > how this type of evaluation is generally done in the context of an
> > upstream change? As a first step, I can open a Jira issue to track the
> > evaluation if you think that would be useful. Thanks again!
> >
> > Cheers,
> > -Greg
> >
> >
> > On Tue, Mar 16, 2021 at 2:11 AM Adrien Grand <[email protected]> wrote:
> > >
> > > Hi Greg,
> > >
> > > You guessed b) right. I had done some microbenchmarks that suggested
> that the optimization of the prefix sum was important performance-wise
> compared to unpacking and then doing the prefix sum.
> > >
> > > The other factor that played a role to me is that frequencies and
> positions are usually less performance-sensitive than doc IDs. For instance
> conjunctions only decode frequencies after reaching agreement across all
> clauses, and phrase queries only decode positions if term frequencies
> suggest that the current document could have a competitive score.
> > >
> > > I would be interested in discussing using PFOR for everything. At
> Elastic we've been looking into building 3-gram indexes to perform infix
> search, and noticed that we would save a lot of space if we moved to PFOR
> rather than FOR for doc ID deltas. The reason is that these postings of
> 3-grams are very dense, and saving one bit per value saves a lot more space
> relatively when blocks have 5 bits per value than when they have e.g. 15
> bits per value.
> > >
> > > On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <[email protected]>
> wrote:
> > >>
> > >> Hi folks-
> > >>
> > >> I'm curious to understand the history/context of using PFOR for
> positions and frequencies while continuing to use basic FOR for docid
> encoding. I've done my best to turn up any past conversations on this, but
> wasn't able to find much. Apologies if I missed it in my digging! From what
> I've gathered, the basic FOR encoding was introduced to Lucene with
> LUCENE-3892 (which was a continuation of LUCENE-1410). While PFOR had been
> discussed plenty in the earlier issues, I gather that it wasn't actually
> committed until LUCENE-9027. Hopefully I've got that much right. And it
> appears at that time to have been introduced for positions and frequencies,
> but not docids.
> > >>
> > >> Is the reasoning here that, a) since docids are delta-encoded
> already, outliers/exceptions will be less likely/beneficial, and b) FOR
> allows for an optimization in decoding the deltas (via.
> ForUtil#decodeAndPrefixSum) which can't be utilized with PFOR, since the
> exceptions must be patched in before decoding deltas? Are the other reasons
> FOR continues to be used for docids that I'm overlooking?
> > >>
> > >> I'm curious as I recently ran some internal benchmarks on the Amazon
> product search engine replacing FOR with PFOR for docids delta encoding,
> and saw an index size reduction of -0.93% while also improving our red-line
> queries/sec by +1.0%. I expected the index size reduction but wasn't
> expecting to see a QPS improvement, which I haven't yet been able to
> explain. I'm wondering if there are some good reasons to keep using FOR for
> docids, or if there'd be any appetite to discuss using PFOR for everything?
> Again, apologies if I've overlooked some past discussion in my digging. Any
> history/context is much appreciated!
> > >>
> > >> Cheers,
> > >> -Greg
> > >
> > >
> > >
> > > --
> > > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Adrien

Re: PFOR for docids?

Reply via email to