[I] [Parquet] Avoid fetching multiple pages when `max_predicate_cache_size`is 0 [arrow-rs]

via GitHub Thu, 02 Oct 2025 14:30:51 -0700


nuno-faria opened a new issue, #8542:
URL: https://github.com/apache/arrow-rs/issues/8542


   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   
   #7850 introduced the cached array reader, which causes multiple data pages 
to be fetched if their size is less than the `batch_size`. Depending on the 
file, workload, and batch size, this might cause regressions in performance 
(e.g., https://github.com/apache/datafusion/issues/17575).
   
   While setting `max_predicate_cache_size` to 0 essentially disables the 
predicate cache, multiple pages are still unnecessarily retrieved.
   
   The sources of this issue are:
    - The `InMemoryRowGroup::fetch` will use the expanded selection based on 
the provided cache `ProjectionMask`.
    - The `ArrayReaderBuilder::build_reader` will use a `CachedArrayReader` 
instead of the regular reader, also based on the cache `ProjectionMask`.
   
   Thus the most straight forward solution I've found is to return None in the 
`ReaderFactory::compute_cache_projection` if the `max_predicate_cache_size` is 
0, causing the reader to fetch only the necessary pages:
   
   ```diff
       fn compute_cache_projection(&self, projection: &ProjectionMask) -> 
Option<ProjectionMask> {
   +       if self.max_predicate_cache_size == 0 {
   +           return None;
   +       }
          ...
       }
   ```
   
   The `ReaderFactory::read_row_group` remains the same, since it already 
expects the possibility of `compute_cache_projection` to return None:
   
   ```rust
           let cache_projection = match 
self.compute_cache_projection(&projection) {
               Some(projection) => projection,
               None => ProjectionMask::none(meta.columns().len()),
           };
   ```
   
   @alamb @XiangpengHao what do you think? Is this the best way to solve the 
issue? If so I can open a PR.
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   
   See https://github.com/apache/datafusion/issues/17575.
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   Retrieve only the minimum required data pages if `max_predicate_cache_size` 
is set to 0.
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Parquet] Avoid fetching multiple pages when `max_predicate_cache_size`is 0 [arrow-rs]

Reply via email to