[GitHub] [arrow] westonpace commented on issue #13403: [C++] How to speed up arrow's reading of S3 Parquet files？

GitBox Thu, 04 Aug 2022 12:40:58 -0700


westonpace commented on issue #13403:
URL: https://github.com/apache/arrow/issues/13403#issuecomment-1205693533

S3 recommends [one concurrent read for every 85-90MB/s of
bandwidth](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html).
If your aim is to saturate that 2500MB/s link then you probably want ~30
concurrent reads.

In Arrow the factors that control your concurrent reads are
ScanOptions::fragment_readahead, ScanOptions::batch_readahead,
ScanOptions::batch_size, the size of your row groups, and the size of your I/O
thread pool.

Some things to try:

* Are you reading all 340 columns? If not, you could probably set a larger
batch size. I do most of my testing around 1 million rows / 20 columns.
`scanner_builder->BatchSize(60*1024);`

* What size row groups do you have in your files? Arrow currently only
reads 1 entire row group at a time. So if each file has a single row group you
will need at least 30 files to achieve max throughput (though this might use a
lot of RAM if the row groups are large).

* You might want to increase the fragment readahead. It defaults to 4 files
in 8.0.

* You probably want to increase the size of the I/O thread pool. It
defaults to 8.
```
arrow::io::SetIoThreadPoolCapacity(32);
```

Do any of these configuration changes improve performance?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #13403: [C++] How to speed up arrow's reading of S3 Parquet files？

Reply via email to