+1 to explore innovations in Lucene's skip-list implementation!

Your initial impressive gains show that indeed skipping is a big cost for
certain queries.

Lucene's multi-level skip list implementation is quite old now (I think
more than a decade?) and I think rather complex.  There is a fun paper
describing a simple and probabilistic approach to inline the (still
multi-level) skip data, linked from this long-ago issue:
https://issues.apache.org/jira/browse/LUCENE-2962  That would be a way to
keep the sequential (slow disk friendly) access and perhaps reduce the cost
versus Lucenes skipping implementation today.

But I do love the simplicity of a simple flat array of the bottom-most skip
data.  Having that as an optional (not default) Codec would be a nice
option.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 27, 2021 at 7:42 PM Greg Miller <[email protected]> wrote:

> Thanks for the reply Adrien! It makes sense to ensure the default
> codec continues to scale for large indexes that can't fit in a
> machine's physical memory. Thanks for the thoughts and context on the
> terms/points indexes, etc.
>
> I'll look into how this idea could be spun off into a separate
> lucene/codecs implementation and open a Jira issue to track that work
> a little later this week. I'll also spend a little time thinking about
> whether-or-not there might be some sort of hybrid solution that
> enables some of these gains while maintaining the sequential access. I
> don't have anything off the top-of-my-head, but maybe if I put a draft
> of my change up as a PR (under lucene/codecs) it will spark some other
> ideas!
>
> Thanks again for your thoughts!
>
> Cheers,
> -Greg
>
> On Tue, Apr 27, 2021 at 2:23 PM Adrien Grand <[email protected]> wrote:
> >
> > Hi Greg,
> >
> > I like that Lucene can scale to index sizes that are much larger than
> the amount of main memory, so I would like the default codec to keep
> optimizing for sequential reads. We do random access for some parts of the
> index like the terms index and the points index, but the expectation is
> that they are so small that would always fit in the page cache (they were
> actually in the JVM heap not long ago). A skip index for every postings
> list feels like something that could be much larger than the terms index so
> I don't think it would be a good fit for the default codec.
> >
> > We could still implement your idea in lucene/codecs for users who can
> force load their index into memory, the speedup looks significant for
> queries that do lots of skipping!
> >
> > These results make me wonder if there are other ways we could get
> similar benefits while keeping a sequential access pattern when reading
> postings?
> >
> > Le mar. 27 avr. 2021 à 21:03, Greg Miller <[email protected]> a écrit :
> >>
> >> Hey folks-
> >>
> >> I've been experimenting with a different approach to implementing skip
> >> lists in Lucene, and would be curious if anyone else has tried
> >> something similar, or has early thoughts/feedback on what I'm doing.
> >>
> >> The idea is generally a simplification of what Lucene currently does,
> >> targeted at improving QPS at search-time. Instead of treating skip
> >> lists as forward-iterating structures, I'm indexing a single, flat
> >> structure that I binary search when advancing. I'm indexing the same
> >> data present in the lowest level of the current structures (i.e., skip
> >> pointers to each block of 128 docs), and then binary searching those
> >> pointers to find the candidate block for the target docID (instead of
> >> only seeking forward).
> >>
> >> Before describing this idea in more detail (or opening a Jira), I'd be
> >> curious how this community thinks about disk access operations and
> >> what platforms/use-cases we primarily optimize for these days. This
> >> approach I've been experimenting with is essentially a non-starter if
> >> we assume that index accesses generally involve actually going to
> >> disk—especially if those disks spin. Random seeks all over the place
> >> are probably a terrible idea if those are actually hitting disk. In
> >> the cases I've been experimenting with, indexes are generally hot, so
> >> random seeks aren't much of a concern. Do we tend to optimize with
> >> this assumption in mind, or are we still pretty careful about
> >> use-cases that are actually doing a lot of disk IO?
> >>
> >> There are some other tricky implications with the approach I'm
> >> experimenting with (some edge cases around Impacts and index size
> >> growth due to not being able to do as much delta-encoding), but it's
> >> not worth going into those until addressing the main idea of
> >> whether-or-not random seeks even make sense.
> >>
> >> I've done some early benchmarking with luceneutil (wikimedium10m
> >> specifically) and the idea looks like it might have some promise. I
> >> don't really see any regressions to QPS*, while a few tasks seem to
> >> show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow:
> >> 11.1%, OrHighLow: 12.3%).
> >> * (note: I've disabled Impacts in both baseline/candidate for early
> >> experiments because of some challenges related to them... so these
> >> results have a major asteriks right now and more work would certainly
> >> need to be done)
> >>
> >> Thanks in advance for any feedback! I'm completely open to shooting
> >> down this idea early if there are some fundamental flaws, or
> >> alternatively opening up a Jira to investigate this further if folks
> >> think it's reasonable to explore more :)
> >>
> >> Cheers,
> >> -Greg
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to