[
https://issues.apache.org/jira/browse/ARROW-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ziheng Wang updated ARROW-17299:
--------------------------------
Summary: [C++] [Python] Expose the Scanner kDefaultBatchReadahead and
kDefaultFragmentReadahead parameters (was: Expose the Scanner
kDefaultBatchReadahead and kDefaultFragmentReadahead parameters)
> [C++] [Python] Expose the Scanner kDefaultBatchReadahead and
> kDefaultFragmentReadahead parameters
> -------------------------------------------------------------------------------------------------
>
> Key: ARROW-17299
> URL: https://issues.apache.org/jira/browse/ARROW-17299
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Python
> Reporter: Ziheng Wang
> Assignee: Ziheng Wang
> Priority: Major
>
> In the Scanner there are parameters kDefaultFragmentReadahead and
> kDefaultBatchReadahead that are currently set to fixed numbers that cannot be
> changed.
> This is not great because tuning these numbers is the key to tradeoff RAM
> usage and network IO utilization during reading. For example on an i3.2xlarge
> instance on AWS you can get peak throughput only by quadrupling
> kDefaultFragmentReadahead from the default.
> The current settings are very conservative and assume a < 1Gbps network.
> Exposing them allow people to tune the Scanner behavior to their own
> hardware.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)