Thanks for the reply Adrien! It makes sense to ensure the default
codec continues to scale for large indexes that can't fit in a
machine's physical memory. Thanks for the thoughts and context on the
terms/points indexes, etc.

I'll look into how this idea could be spun off into a separate
lucene/codecs implementation and open a Jira issue to track that work
a little later this week. I'll also spend a little time thinking about
whether-or-not there might be some sort of hybrid solution that
enables some of these gains while maintaining the sequential access. I
don't have anything off the top-of-my-head, but maybe if I put a draft
of my change up as a PR (under lucene/codecs) it will spark some other
ideas!

Thanks again for your thoughts!

Cheers,
-Greg

On Tue, Apr 27, 2021 at 2:23 PM Adrien Grand <[email protected]> wrote:
>
> Hi Greg,
>
> I like that Lucene can scale to index sizes that are much larger than the 
> amount of main memory, so I would like the default codec to keep optimizing 
> for sequential reads. We do random access for some parts of the index like 
> the terms index and the points index, but the expectation is that they are so 
> small that would always fit in the page cache (they were actually in the JVM 
> heap not long ago). A skip index for every postings list feels like something 
> that could be much larger than the terms index so I don't think it would be a 
> good fit for the default codec.
>
> We could still implement your idea in lucene/codecs for users who can force 
> load their index into memory, the speedup looks significant for queries that 
> do lots of skipping!
>
> These results make me wonder if there are other ways we could get similar 
> benefits while keeping a sequential access pattern when reading postings?
>
> Le mar. 27 avr. 2021 à 21:03, Greg Miller <[email protected]> a écrit :
>>
>> Hey folks-
>>
>> I've been experimenting with a different approach to implementing skip
>> lists in Lucene, and would be curious if anyone else has tried
>> something similar, or has early thoughts/feedback on what I'm doing.
>>
>> The idea is generally a simplification of what Lucene currently does,
>> targeted at improving QPS at search-time. Instead of treating skip
>> lists as forward-iterating structures, I'm indexing a single, flat
>> structure that I binary search when advancing. I'm indexing the same
>> data present in the lowest level of the current structures (i.e., skip
>> pointers to each block of 128 docs), and then binary searching those
>> pointers to find the candidate block for the target docID (instead of
>> only seeking forward).
>>
>> Before describing this idea in more detail (or opening a Jira), I'd be
>> curious how this community thinks about disk access operations and
>> what platforms/use-cases we primarily optimize for these days. This
>> approach I've been experimenting with is essentially a non-starter if
>> we assume that index accesses generally involve actually going to
>> disk—especially if those disks spin. Random seeks all over the place
>> are probably a terrible idea if those are actually hitting disk. In
>> the cases I've been experimenting with, indexes are generally hot, so
>> random seeks aren't much of a concern. Do we tend to optimize with
>> this assumption in mind, or are we still pretty careful about
>> use-cases that are actually doing a lot of disk IO?
>>
>> There are some other tricky implications with the approach I'm
>> experimenting with (some edge cases around Impacts and index size
>> growth due to not being able to do as much delta-encoding), but it's
>> not worth going into those until addressing the main idea of
>> whether-or-not random seeks even make sense.
>>
>> I've done some early benchmarking with luceneutil (wikimedium10m
>> specifically) and the idea looks like it might have some promise. I
>> don't really see any regressions to QPS*, while a few tasks seem to
>> show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow:
>> 11.1%, OrHighLow: 12.3%).
>> * (note: I've disabled Impacts in both baseline/candidate for early
>> experiments because of some challenges related to them... so these
>> results have a major asteriks right now and more work would certainly
>> need to be done)
>>
>> Thanks in advance for any feedback! I'm completely open to shooting
>> down this idea early if there are some fundamental flaws, or
>> alternatively opening up a Jira to investigate this further if folks
>> think it's reasonable to explore more :)
>>
>> Cheers,
>> -Greg
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to