Thanks Mike! I've started reading that paper, and it's interesting indeed! I don't have much time to explore this further at the moment, but I plan to revisit when I do. It'd be fun to explore the idea in the paper to see if a "best of both worlds" approach is possible (speedups of binary searching while maintaining forward-iteration paradigm)! It would be interesting to compare that implementation to the simple, flat binary-searched list as well.
Cheers, -Greg On Mon, May 3, 2021 at 10:31 AM Michael McCandless <[email protected]> wrote: > > +1 to explore innovations in Lucene's skip-list implementation! > > Your initial impressive gains show that indeed skipping is a big cost for > certain queries. > > Lucene's multi-level skip list implementation is quite old now (I think more > than a decade?) and I think rather complex. There is a fun paper describing > a simple and probabilistic approach to inline the (still multi-level) skip > data, linked from this long-ago issue: > https://issues.apache.org/jira/browse/LUCENE-2962 That would be a way to > keep the sequential (slow disk friendly) access and perhaps reduce the cost > versus Lucenes skipping implementation today. > > But I do love the simplicity of a simple flat array of the bottom-most skip > data. Having that as an optional (not default) Codec would be a nice option. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Apr 27, 2021 at 7:42 PM Greg Miller <[email protected]> wrote: >> >> Thanks for the reply Adrien! It makes sense to ensure the default >> codec continues to scale for large indexes that can't fit in a >> machine's physical memory. Thanks for the thoughts and context on the >> terms/points indexes, etc. >> >> I'll look into how this idea could be spun off into a separate >> lucene/codecs implementation and open a Jira issue to track that work >> a little later this week. I'll also spend a little time thinking about >> whether-or-not there might be some sort of hybrid solution that >> enables some of these gains while maintaining the sequential access. I >> don't have anything off the top-of-my-head, but maybe if I put a draft >> of my change up as a PR (under lucene/codecs) it will spark some other >> ideas! >> >> Thanks again for your thoughts! >> >> Cheers, >> -Greg >> >> On Tue, Apr 27, 2021 at 2:23 PM Adrien Grand <[email protected]> wrote: >> > >> > Hi Greg, >> > >> > I like that Lucene can scale to index sizes that are much larger than the >> > amount of main memory, so I would like the default codec to keep >> > optimizing for sequential reads. We do random access for some parts of the >> > index like the terms index and the points index, but the expectation is >> > that they are so small that would always fit in the page cache (they were >> > actually in the JVM heap not long ago). A skip index for every postings >> > list feels like something that could be much larger than the terms index >> > so I don't think it would be a good fit for the default codec. >> > >> > We could still implement your idea in lucene/codecs for users who can >> > force load their index into memory, the speedup looks significant for >> > queries that do lots of skipping! >> > >> > These results make me wonder if there are other ways we could get similar >> > benefits while keeping a sequential access pattern when reading postings? >> > >> > Le mar. 27 avr. 2021 à 21:03, Greg Miller <[email protected]> a écrit : >> >> >> >> Hey folks- >> >> >> >> I've been experimenting with a different approach to implementing skip >> >> lists in Lucene, and would be curious if anyone else has tried >> >> something similar, or has early thoughts/feedback on what I'm doing. >> >> >> >> The idea is generally a simplification of what Lucene currently does, >> >> targeted at improving QPS at search-time. Instead of treating skip >> >> lists as forward-iterating structures, I'm indexing a single, flat >> >> structure that I binary search when advancing. I'm indexing the same >> >> data present in the lowest level of the current structures (i.e., skip >> >> pointers to each block of 128 docs), and then binary searching those >> >> pointers to find the candidate block for the target docID (instead of >> >> only seeking forward). >> >> >> >> Before describing this idea in more detail (or opening a Jira), I'd be >> >> curious how this community thinks about disk access operations and >> >> what platforms/use-cases we primarily optimize for these days. This >> >> approach I've been experimenting with is essentially a non-starter if >> >> we assume that index accesses generally involve actually going to >> >> disk—especially if those disks spin. Random seeks all over the place >> >> are probably a terrible idea if those are actually hitting disk. In >> >> the cases I've been experimenting with, indexes are generally hot, so >> >> random seeks aren't much of a concern. Do we tend to optimize with >> >> this assumption in mind, or are we still pretty careful about >> >> use-cases that are actually doing a lot of disk IO? >> >> >> >> There are some other tricky implications with the approach I'm >> >> experimenting with (some edge cases around Impacts and index size >> >> growth due to not being able to do as much delta-encoding), but it's >> >> not worth going into those until addressing the main idea of >> >> whether-or-not random seeks even make sense. >> >> >> >> I've done some early benchmarking with luceneutil (wikimedium10m >> >> specifically) and the idea looks like it might have some promise. I >> >> don't really see any regressions to QPS*, while a few tasks seem to >> >> show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow: >> >> 11.1%, OrHighLow: 12.3%). >> >> * (note: I've disabled Impacts in both baseline/candidate for early >> >> experiments because of some challenges related to them... so these >> >> results have a major asteriks right now and more work would certainly >> >> need to be done) >> >> >> >> Thanks in advance for any feedback! I'm completely open to shooting >> >> down this idea early if there are some fundamental flaws, or >> >> alternatively opening up a Jira to investigate this further if folks >> >> think it's reasonable to explore more :) >> >> >> >> Cheers, >> >> -Greg >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
