sohami commented on issue #13179: URL: https://github.com/apache/lucene/issues/13179#issuecomment-2138876791
> To avoid this per-doc overhead, I imagine that we would need to add some prefetch() API on (Numeric|SortedNumeric|Sorted|SortedSet|Binary)DocValues like @sohami suggests and require it to be called at a higher level when this can be more easily amortized across many docs, e.g. by making BulkScorer score ranges of X doc IDs at once and only calling prefetch once per range. Ya I was thinking that this `prefetch` API could take in bitset of docIds. > Maybe such an approach would be ok for application code that can make assumptions about how much page cache it has, but I'm expecting Lucene code to avoid ever prefetching many MBs at once, because this increases chances that the first bytes that got prefetched got paged out before we could use them. This is one reason why I like the approach of just giving a hint to the IndexInput that it should perform read-ahead, IndexInput impls that read from fast storage can read ahead relatively little, in the order of a few pages, while IndexInput impls that read from slower storage like the warm index use-case that @sohami describes above could fetch MBs from slow remote storage and cache it on a local disk or something like that to reduce interactions with the slow remote storage. If I understand correctly, the read ahead mechanism in `IndexInput` will be useful if matching docs fall within the read ahead size. Otherwise those will be wasted pages cached or downloaded in the warm index use-case and prefetch will not be useful. Instead was thinking that if we have bulk prefetch API in `IndexInput` layer too (along with in the DocValues) which takes in say list of offset and may be length as well, then each `IndexInput` can internally make a decision to limit the prefetch vs perform prefetch for each of the provided input ? One mechanism to limit the `prefetch` can be to decide based on different `offsets` on how many pages needs to be prefetched and limit the distinct pages based on some threshold which the `IndexInput` implementation can decide on. > If one of you would like to take a stab at an approach to prefetching doc values, I'd be happy to look at a PR. Sure, I can take a stab for say `NumericDocValues` and in context of facets to start with. Will have to explore how facets work in lucene so it may take me sometime. Let me know if that sounds good to you as a starting poc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org