Hey folks- I've been experimenting with a different approach to implementing skip lists in Lucene, and would be curious if anyone else has tried something similar, or has early thoughts/feedback on what I'm doing.
The idea is generally a simplification of what Lucene currently does, targeted at improving QPS at search-time. Instead of treating skip lists as forward-iterating structures, I'm indexing a single, flat structure that I binary search when advancing. I'm indexing the same data present in the lowest level of the current structures (i.e., skip pointers to each block of 128 docs), and then binary searching those pointers to find the candidate block for the target docID (instead of only seeking forward). Before describing this idea in more detail (or opening a Jira), I'd be curious how this community thinks about disk access operations and what platforms/use-cases we primarily optimize for these days. This approach I've been experimenting with is essentially a non-starter if we assume that index accesses generally involve actually going to disk—especially if those disks spin. Random seeks all over the place are probably a terrible idea if those are actually hitting disk. In the cases I've been experimenting with, indexes are generally hot, so random seeks aren't much of a concern. Do we tend to optimize with this assumption in mind, or are we still pretty careful about use-cases that are actually doing a lot of disk IO? There are some other tricky implications with the approach I'm experimenting with (some edge cases around Impacts and index size growth due to not being able to do as much delta-encoding), but it's not worth going into those until addressing the main idea of whether-or-not random seeks even make sense. I've done some early benchmarking with luceneutil (wikimedium10m specifically) and the idea looks like it might have some promise. I don't really see any regressions to QPS*, while a few tasks seem to show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow: 11.1%, OrHighLow: 12.3%). * (note: I've disabled Impacts in both baseline/candidate for early experiments because of some challenges related to them... so these results have a major asteriks right now and more work would certainly need to be done) Thanks in advance for any feedback! I'm completely open to shooting down this idea early if there are some fundamental flaws, or alternatively opening up a Jira to investigate this further if folks think it's reasonable to explore more :) Cheers, -Greg --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
