nuno-faria opened a new issue, #8542: URL: https://github.com/apache/arrow-rs/issues/8542
**Describe the bug** <!-- A clear and concise description of what the bug is. --> #7850 introduced the cached array reader, which causes multiple data pages to be fetched if their size is less than the `batch_size`. Depending on the file, workload, and batch size, this might cause regressions in performance (e.g., https://github.com/apache/datafusion/issues/17575). While setting `max_predicate_cache_size` to 0 essentially disables the predicate cache, multiple pages are still unnecessarily retrieved. The sources of this issue are: - The `InMemoryRowGroup::fetch` will use the expanded selection based on the provided cache `ProjectionMask`. - The `ArrayReaderBuilder::build_reader` will use a `CachedArrayReader` instead of the regular reader, also based on the cache `ProjectionMask`. Thus the most straight forward solution I've found is to return None in the `ReaderFactory::compute_cache_projection` if the `max_predicate_cache_size` is 0, causing the reader to fetch only the necessary pages: ```diff fn compute_cache_projection(&self, projection: &ProjectionMask) -> Option<ProjectionMask> { + if self.max_predicate_cache_size == 0 { + return None; + } ... } ``` The `ReaderFactory::read_row_group` remains the same, since it already expects the possibility of `compute_cache_projection` to return None: ```rust let cache_projection = match self.compute_cache_projection(&projection) { Some(projection) => projection, None => ProjectionMask::none(meta.columns().len()), }; ``` @alamb @XiangpengHao what do you think? Is this the best way to solve the issue? If so I can open a PR. **To Reproduce** <!-- Steps to reproduce the behavior: --> See https://github.com/apache/datafusion/issues/17575. **Expected behavior** <!-- A clear and concise description of what you expected to happen. --> Retrieve only the minimum required data pages if `max_predicate_cache_size` is set to 0. **Additional context** <!-- Add any other context about the problem here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
