westonpace commented on issue #13403: URL: https://github.com/apache/arrow/issues/13403#issuecomment-1205693533
S3 recommends [one concurrent read for every 85-90MB/s of bandwidth](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html). If your aim is to saturate that 2500MB/s link then you probably want ~30 concurrent reads. In Arrow the factors that control your concurrent reads are ScanOptions::fragment_readahead, ScanOptions::batch_readahead, ScanOptions::batch_size, the size of your row groups, and the size of your I/O thread pool. Some things to try: * Are you reading all 340 columns? If not, you could probably set a larger batch size. I do most of my testing around 1 million rows / 20 columns. `scanner_builder->BatchSize(60*1024);` * What size row groups do you have in your files? Arrow currently only reads 1 entire row group at a time. So if each file has a single row group you will need at least 30 files to achieve max throughput (though this might use a lot of RAM if the row groups are large). * You might want to increase the fragment readahead. It defaults to 4 files in 8.0. * You probably want to increase the size of the I/O thread pool. It defaults to 8. ``` arrow::io::SetIoThreadPoolCapacity(32); ``` Do any of these configuration changes improve performance? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
