Tom-Newton commented on issue #40035: URL: https://github.com/apache/arrow/issues/40035#issuecomment-2119279863
I've been thinking a bit about a policy we can use to set these parameters automatically (as @felipecrv suggested). So far I'm thinking each call to `ReadAt` could be one iteration of an optimisation algorithm e.g. gradient descent optimising for `ReadAt` duration. If we imagine varying `chunk_size` from very small to very large the optimal performance is going to be somewhere in the middle and I expect the optimum will depend on typical latency and bandwidth between blob storage and the client. I've also taken a bit of a look at how [`azcopy`](https://github.com/Azure/azure-storage-azcopy) implements this since my experience is that `azcopy` is very fast in a wide variety of situations without needing to provide any configuration. I thought it was interesting that it deliberately tries to read out of order https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/cmd/copyEnumeratorHelper.go#L21 I think the downloading actually happens here https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/ste/xfer-remoteToLocal-file.go#L307-L332 There is mention of applying some auto-correct policy on the block-size if the user did not set it. I'm definitely going to dig a bit deeper into what `azcopy` does before I commit to a particular strategy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
