+1 to explore innovations in Lucene's skip-list implementation! Your initial impressive gains show that indeed skipping is a big cost for certain queries.
Lucene's multi-level skip list implementation is quite old now (I think more than a decade?) and I think rather complex. There is a fun paper describing a simple and probabilistic approach to inline the (still multi-level) skip data, linked from this long-ago issue: https://issues.apache.org/jira/browse/LUCENE-2962 That would be a way to keep the sequential (slow disk friendly) access and perhaps reduce the cost versus Lucenes skipping implementation today. But I do love the simplicity of a simple flat array of the bottom-most skip data. Having that as an optional (not default) Codec would be a nice option. Mike McCandless http://blog.mikemccandless.com On Tue, Apr 27, 2021 at 7:42 PM Greg Miller <[email protected]> wrote: > Thanks for the reply Adrien! It makes sense to ensure the default > codec continues to scale for large indexes that can't fit in a > machine's physical memory. Thanks for the thoughts and context on the > terms/points indexes, etc. > > I'll look into how this idea could be spun off into a separate > lucene/codecs implementation and open a Jira issue to track that work > a little later this week. I'll also spend a little time thinking about > whether-or-not there might be some sort of hybrid solution that > enables some of these gains while maintaining the sequential access. I > don't have anything off the top-of-my-head, but maybe if I put a draft > of my change up as a PR (under lucene/codecs) it will spark some other > ideas! > > Thanks again for your thoughts! > > Cheers, > -Greg > > On Tue, Apr 27, 2021 at 2:23 PM Adrien Grand <[email protected]> wrote: > > > > Hi Greg, > > > > I like that Lucene can scale to index sizes that are much larger than > the amount of main memory, so I would like the default codec to keep > optimizing for sequential reads. We do random access for some parts of the > index like the terms index and the points index, but the expectation is > that they are so small that would always fit in the page cache (they were > actually in the JVM heap not long ago). A skip index for every postings > list feels like something that could be much larger than the terms index so > I don't think it would be a good fit for the default codec. > > > > We could still implement your idea in lucene/codecs for users who can > force load their index into memory, the speedup looks significant for > queries that do lots of skipping! > > > > These results make me wonder if there are other ways we could get > similar benefits while keeping a sequential access pattern when reading > postings? > > > > Le mar. 27 avr. 2021 à 21:03, Greg Miller <[email protected]> a écrit : > >> > >> Hey folks- > >> > >> I've been experimenting with a different approach to implementing skip > >> lists in Lucene, and would be curious if anyone else has tried > >> something similar, or has early thoughts/feedback on what I'm doing. > >> > >> The idea is generally a simplification of what Lucene currently does, > >> targeted at improving QPS at search-time. Instead of treating skip > >> lists as forward-iterating structures, I'm indexing a single, flat > >> structure that I binary search when advancing. I'm indexing the same > >> data present in the lowest level of the current structures (i.e., skip > >> pointers to each block of 128 docs), and then binary searching those > >> pointers to find the candidate block for the target docID (instead of > >> only seeking forward). > >> > >> Before describing this idea in more detail (or opening a Jira), I'd be > >> curious how this community thinks about disk access operations and > >> what platforms/use-cases we primarily optimize for these days. This > >> approach I've been experimenting with is essentially a non-starter if > >> we assume that index accesses generally involve actually going to > >> disk—especially if those disks spin. Random seeks all over the place > >> are probably a terrible idea if those are actually hitting disk. In > >> the cases I've been experimenting with, indexes are generally hot, so > >> random seeks aren't much of a concern. Do we tend to optimize with > >> this assumption in mind, or are we still pretty careful about > >> use-cases that are actually doing a lot of disk IO? > >> > >> There are some other tricky implications with the approach I'm > >> experimenting with (some edge cases around Impacts and index size > >> growth due to not being able to do as much delta-encoding), but it's > >> not worth going into those until addressing the main idea of > >> whether-or-not random seeks even make sense. > >> > >> I've done some early benchmarking with luceneutil (wikimedium10m > >> specifically) and the idea looks like it might have some promise. I > >> don't really see any regressions to QPS*, while a few tasks seem to > >> show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow: > >> 11.1%, OrHighLow: 12.3%). > >> * (note: I've disabled Impacts in both baseline/candidate for early > >> experiments because of some challenges related to them... so these > >> results have a major asteriks right now and more work would certainly > >> need to be done) > >> > >> Thanks in advance for any feedback! I'm completely open to shooting > >> down this idea early if there are some fundamental flaws, or > >> alternatively opening up a Jira to investigate this further if folks > >> think it's reasonable to explore more :) > >> > >> Cheers, > >> -Greg > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
