Re: Intra-segment search concurrency implementation

Alan Woodward Wed, 31 Jul 2024 02:37:36 -0700

Hi Luca,

This is very exciting!  I haven’t followed the dev process very closely so far, 
so this may already have been looked at and dismissed as unworkable for various 
reasons, but I’m wondering if we definitely need a new abstraction for a 
LeafReaderContext partition?  Could we instead find a way to make 
IndexReader.leaves() return a view over the various segments that splits large 
segments into multiple LeafReaderContexts with different subsets of the docId 
space marked as deleted?


I suppose we could lose some optimisations in count() implementations, but 
maybe it would be possible to check up-front if the count() for a segment 
returns -1 and only do the split in that case.

- Alan

> On 29 Jul 2024, at 22:45, Luca Cavanna <java...@apache.org> wrote:
> 
> Hey all,
> I have been working on an initial implementation of intra-segment search 
> concurrency for Lucene.
> 
> My goal is to introduce the ability to concurrently search partitions of the 
> same segment, think of a force-merged segment for instance, in a way that's 
> as transparent as possible to users. This way we can ideally decouple search 
> concurrency from the index geometry, with the least impact on users. As part 
> of my initial step, I decided to not tackle deduplicating work that happens 
> globally per segment, which every partition would repeat on its own. This is 
> certainly an important area to improve upon, yet I am hoping that we can 
> treat it as a follow-up, mostly because there is enough work to do even 
> without addressing that.
> 
> After quite a few iterations, I have just marked my PR ready for review: 
> https://github.com/apache/lucene/pull/13542. Tests are finally green. I wrote 
> a rather detailed description on the PR itself that includes the problems I 
> encountered, how I addressed them, and the way forward that I am proposing. 
> There are still a couple of rough edges, and needed alignment on terminology 
> API-wise. Mostly, what do we call a partition of a segment? Existing leaf 
> slices are partitions of an index. We are now introducing partitions of 
> segments that can be searched independently. I called them 
> LeafReaderContextPartition, but I am not particularly attached to this 
> specific name and open to feedback. This new terminology is only applied to 
> the IndexSearcher#search method (not called directly by users though) and the 
> IndexSearcher slices related methods. Otherwise, users that just call search 
> don't need to necessarily know what a segment partition is, hopefully.
> 
> I'd love to collect enough feedback to agree on a path forward and get this 
> merged for Lucene 10, as it requires some API breaking changes as well as 
> changes in internal behaviour.
> 
> 
> Looking forward to your feedback
> 
> Cheers
> Luca
> 
> 
>

Re: Intra-segment search concurrency implementation

Reply via email to