Re: [I] Improve Lucene's I/O concurrency [lucene]

via GitHub Thu, 30 May 2024 00:39:34 -0700


sohami commented on issue #13179:
URL: https://github.com/apache/lucene/issues/13179#issuecomment-2138876791


   > To avoid this per-doc overhead, I imagine that we would need to add some 
prefetch() API on (Numeric|SortedNumeric|Sorted|SortedSet|Binary)DocValues like 
@sohami suggests and require it to be called at a higher level when this can be 
more easily amortized across many docs, e.g. by making BulkScorer score ranges 
of X doc IDs at once and only calling prefetch once per range.
   
   Ya I was thinking that this `prefetch` API could take in bitset of docIds.
   
   > Maybe such an approach would be ok for application code that can make 
assumptions about how much page cache it has, but I'm expecting Lucene code to 
avoid ever prefetching many MBs at once, because this increases chances that 
the first bytes that got prefetched got paged out before we could use them. 
This is one reason why I like the approach of just giving a hint to the 
IndexInput that it should perform read-ahead, IndexInput impls that read from 
fast storage can read ahead relatively little, in the order of a few pages, 
while IndexInput impls that read from slower storage like the warm index 
use-case that @sohami describes above could fetch MBs from slow remote storage 
and cache it on a local disk or something like that to reduce interactions with 
the slow remote storage.
   
   If I understand correctly, the read ahead mechanism in `IndexInput` will be 
useful if matching docs fall within the read ahead size. Otherwise those will 
be wasted pages cached or downloaded in the warm index use-case and prefetch 
will not be useful. Instead was thinking that if we have bulk prefetch API in 
`IndexInput` layer too (along with in the DocValues) which takes in say list of 
offset and may be length as well, then each `IndexInput` can internally make a 
decision to limit the prefetch vs perform prefetch for each of the provided 
input ? One mechanism to limit the `prefetch` can be to decide based on 
different `offsets` on how many pages needs to be prefetched and limit the 
distinct pages based on some threshold which the `IndexInput` implementation 
can decide on.
   
   > If one of you would like to take a stab at an approach to prefetching doc 
values, I'd be happy to look at a PR.
   
   Sure, I can take a stab for say `NumericDocValues` and in context of facets 
to start with. Will have to explore how facets work in lucene so it may take me 
sometime. Let me know if that sounds good to you as a starting poc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Improve Lucene's I/O concurrency [lucene]

Reply via email to