Weston Pace created ARROW-14648:
-----------------------------------
Summary: [C++][Dataset] Change scanner readahead limits to be
based on bytes instead of number of batches
Key: ARROW-14648
URL: https://issues.apache.org/jira/browse/ARROW-14648
Project: Apache Arrow
Issue Type: Improvement
Reporter: Weston Pace
In the scanner readahead is controlled by "batch_readahead" and
"fragment_readahead" (both specified in the scan options). This was mainly
motivated on my work with CSV and the defaults of 32 and 8 will cause the
scanner to buffer ~256MB of data (given the default block size of 1MB).
For parquet / IPC this would mean we are buffering 256 row groups which is
entirely too high.
Rather than make users figure out complex parameters we should have a single
readahead limit that is specified in bytes.
This will be "best effort". I'm not suggest we support partial reads of row
groups / record batches so if the limit is set very small we still might end up
with more in RAM just because we can only load entire row groups.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)