Thanks Mike! I've started reading that paper, and it's interesting
indeed! I don't have much time to explore this further at the moment,
but I plan to revisit when I do. It'd be fun to explore the idea in
the paper to see if a "best of both worlds" approach is possible
(speedups of binary searching while maintaining forward-iteration
paradigm)! It would be interesting to compare that implementation to
the simple, flat binary-searched list as well.

Cheers,
-Greg

On Mon, May 3, 2021 at 10:31 AM Michael McCandless
<[email protected]> wrote:
>
> +1 to explore innovations in Lucene's skip-list implementation!
>
> Your initial impressive gains show that indeed skipping is a big cost for 
> certain queries.
>
> Lucene's multi-level skip list implementation is quite old now (I think more 
> than a decade?) and I think rather complex.  There is a fun paper describing 
> a simple and probabilistic approach to inline the (still multi-level) skip 
> data, linked from this long-ago issue: 
> https://issues.apache.org/jira/browse/LUCENE-2962  That would be a way to 
> keep the sequential (slow disk friendly) access and perhaps reduce the cost 
> versus Lucenes skipping implementation today.
>
> But I do love the simplicity of a simple flat array of the bottom-most skip 
> data.  Having that as an optional (not default) Codec would be a nice option.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 27, 2021 at 7:42 PM Greg Miller <[email protected]> wrote:
>>
>> Thanks for the reply Adrien! It makes sense to ensure the default
>> codec continues to scale for large indexes that can't fit in a
>> machine's physical memory. Thanks for the thoughts and context on the
>> terms/points indexes, etc.
>>
>> I'll look into how this idea could be spun off into a separate
>> lucene/codecs implementation and open a Jira issue to track that work
>> a little later this week. I'll also spend a little time thinking about
>> whether-or-not there might be some sort of hybrid solution that
>> enables some of these gains while maintaining the sequential access. I
>> don't have anything off the top-of-my-head, but maybe if I put a draft
>> of my change up as a PR (under lucene/codecs) it will spark some other
>> ideas!
>>
>> Thanks again for your thoughts!
>>
>> Cheers,
>> -Greg
>>
>> On Tue, Apr 27, 2021 at 2:23 PM Adrien Grand <[email protected]> wrote:
>> >
>> > Hi Greg,
>> >
>> > I like that Lucene can scale to index sizes that are much larger than the 
>> > amount of main memory, so I would like the default codec to keep 
>> > optimizing for sequential reads. We do random access for some parts of the 
>> > index like the terms index and the points index, but the expectation is 
>> > that they are so small that would always fit in the page cache (they were 
>> > actually in the JVM heap not long ago). A skip index for every postings 
>> > list feels like something that could be much larger than the terms index 
>> > so I don't think it would be a good fit for the default codec.
>> >
>> > We could still implement your idea in lucene/codecs for users who can 
>> > force load their index into memory, the speedup looks significant for 
>> > queries that do lots of skipping!
>> >
>> > These results make me wonder if there are other ways we could get similar 
>> > benefits while keeping a sequential access pattern when reading postings?
>> >
>> > Le mar. 27 avr. 2021 à 21:03, Greg Miller <[email protected]> a écrit :
>> >>
>> >> Hey folks-
>> >>
>> >> I've been experimenting with a different approach to implementing skip
>> >> lists in Lucene, and would be curious if anyone else has tried
>> >> something similar, or has early thoughts/feedback on what I'm doing.
>> >>
>> >> The idea is generally a simplification of what Lucene currently does,
>> >> targeted at improving QPS at search-time. Instead of treating skip
>> >> lists as forward-iterating structures, I'm indexing a single, flat
>> >> structure that I binary search when advancing. I'm indexing the same
>> >> data present in the lowest level of the current structures (i.e., skip
>> >> pointers to each block of 128 docs), and then binary searching those
>> >> pointers to find the candidate block for the target docID (instead of
>> >> only seeking forward).
>> >>
>> >> Before describing this idea in more detail (or opening a Jira), I'd be
>> >> curious how this community thinks about disk access operations and
>> >> what platforms/use-cases we primarily optimize for these days. This
>> >> approach I've been experimenting with is essentially a non-starter if
>> >> we assume that index accesses generally involve actually going to
>> >> disk—especially if those disks spin. Random seeks all over the place
>> >> are probably a terrible idea if those are actually hitting disk. In
>> >> the cases I've been experimenting with, indexes are generally hot, so
>> >> random seeks aren't much of a concern. Do we tend to optimize with
>> >> this assumption in mind, or are we still pretty careful about
>> >> use-cases that are actually doing a lot of disk IO?
>> >>
>> >> There are some other tricky implications with the approach I'm
>> >> experimenting with (some edge cases around Impacts and index size
>> >> growth due to not being able to do as much delta-encoding), but it's
>> >> not worth going into those until addressing the main idea of
>> >> whether-or-not random seeks even make sense.
>> >>
>> >> I've done some early benchmarking with luceneutil (wikimedium10m
>> >> specifically) and the idea looks like it might have some promise. I
>> >> don't really see any regressions to QPS*, while a few tasks seem to
>> >> show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow:
>> >> 11.1%, OrHighLow: 12.3%).
>> >> * (note: I've disabled Impacts in both baseline/candidate for early
>> >> experiments because of some challenges related to them... so these
>> >> results have a major asteriks right now and more work would certainly
>> >> need to be done)
>> >>
>> >> Thanks in advance for any feedback! I'm completely open to shooting
>> >> down this idea early if there are some fundamental flaws, or
>> >> alternatively opening up a Jira to investigate this further if folks
>> >> think it's reasonable to explore more :)
>> >>
>> >> Cheers,
>> >> -Greg
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to