[I] [Paruqet] PushDecoder: Add a peek API to support pre-fetching [arrow-rs]

via GitHub Mon, 20 Oct 2025 13:00:49 -0700


alamb opened a new issue, #8668:
URL: https://github.com/apache/arrow-rs/issues/8668


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   - part of https://github.com/apache/arrow-rs/issues/8000
   
    Unlike streams of JSON / CSV, the data that the parquet reader needs next i 
is not easy to predict as it depends on the filters, the row groups, which 
columns are requested, etc.
   
   Now that  we have the initial PushDecoder in this PR
   - https://github.com/apache/arrow-rs/pull/7997
   
   We will be in the position to add an API for the decoder to communicate what 
data will be needed next
   
   **Describe the solution you'd like**
   I would like an API that allows users of the Parquet decoder to have more 
fine grained control over peeking
   
   **Describe alternatives you've considered**
   
   Here is an idea from @adriangb on 
https://github.com/apache/arrow-rs/pull/7997/files#r2444922393
   
   > a method along the lines of try_peek()? It'd be cool if it returned some 
structure that allowed fine grained control of the peeking:
   
   ```rust
   let max_ranges = 32;
   let max_bytes = 1024 * 1024 * 32;
   let mut current_bytes = 0;
   let mut ranges = Vec::new();
   let mut peek = decoder.peek()
   loop {
       match peek.next() {
           PeekResult::Range(range) => {
               ranges.push(range);
               current_bytes += range.end - range.start;
               if ranges.len() > max_ranges { break }
               if current_bytes > max_bytes { break }
           PeekResult::End { break }
       }
   }
   ```
   
   Here is another potential API from the original ticket:
   ```rust
   
   // Create a decoder for decoding parquet data as above
   let mut decoder: ParquetDecoderBuilder = ...;
   
   // As the decoder up from what data it will need, start prefetching data if 
desired
   while let Some(pre_request) = decoder.peek_next_requests() {
       // note that this is a peek and if we call peek again in the
       // future, we may get a different set of pre_requests (for example
       // if the decoder has applied a row filter and ruled out
       // some row groups or data pages)
       start_prefetch(pre_request);
   }
   
   // push data to the decoder as before, but hopefully the reader
   // will have already prefetched some of the data
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Paruqet] PushDecoder: Add a peek API to support pre-fetching [arrow-rs]

Reply via email to