[I] [C++][Dataset] Change scanner readahead limits to be based on bytes instead of number of batches [arrow]

via GitHub Fri, 26 Jun 2026 02:48:47 -0700


asfimport opened a new issue, #30191:
URL: https://github.com/apache/arrow/issues/30191


   In the scanner readahead is controlled by "batch_readahead" and 
"fragment_readahead" (both specified in the scan options).  This was mainly 
motivated on my work with CSV and the defaults of 32 and 8 will cause the 
scanner to buffer ~256MB of data (given the default block size of 1MB).
   
   
   For parquet / IPC this would mean we are buffering 256 row groups which is 
entirely too high.
   
   Rather than make users figure out complex parameters we should have a single 
readahead limit that is specified in bytes.
   
   This will be "best effort".  I'm not suggest we support partial reads of row 
groups / record batches so if the limit is set very small we still might end up 
with more in RAM just because we can only load entire row groups.
   
   **Reporter**: [Weston 
Pace](https://issues.apache.org/jira/browse/ARROW-14648) / @westonpace
   #### Related issues:
   - [[C++] Change dataset readahead to be based on available RAM/CPU instead 
of fixed constants/options](https://github.com/apache/arrow/issues/27859) (is 
duplicated by)
   - [[C++] Improve performance of parquet 
readahead](https://github.com/apache/arrow/issues/31683) (is related to)
   - [[C++][R]Opening a multi-file dataset and writing a re-partitioned version 
of it fails](https://github.com/apache/arrow/issues/18944) (is depended upon by)
   - [[C++][Datasets] Improve memory usage of 
datasets](https://github.com/apache/arrow/issues/30893) (is depended upon by)
   
   <sub>**Note**: *This issue was originally created as 
[ARROW-14648](https://issues.apache.org/jira/browse/ARROW-14648). Please see 
the [migration documentation](https://github.com/apache/arrow/issues/14542) for 
further details.*</sub>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [C++][Dataset] Change scanner readahead limits to be based on bytes instead of number of batches [arrow]

Reply via email to