Tom-Newton commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-2119279863

   I've been thinking a bit about a policy we can use to set these parameters 
automatically (as @felipecrv suggested). So far I'm thinking each call to 
`ReadAt` could be one iteration of an optimisation algorithm e.g. gradient 
descent optimising for `ReadAt` duration. 
   
   If we imagine varying `chunk_size` from very small to very large the optimal 
performance is going to be somewhere in the middle and I expect the optimum 
will depend on typical latency and bandwidth between blob storage and the 
client. 
   
   I've also taken a bit of a  look at how 
[`azcopy`](https://github.com/Azure/azure-storage-azcopy) implements this since 
my experience is that `azcopy` is very fast in a wide variety of situations 
without needing to provide any configuration. 
   I thought it was interesting that it deliberately tries to read out of order 
https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/cmd/copyEnumeratorHelper.go#L21
   I think the downloading actually happens here 
https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/ste/xfer-remoteToLocal-file.go#L307-L332
   There is mention of applying some auto-correct policy on the block-size if 
the user did not set it. 
   
   I'm definitely going to dig a bit deeper into what `azcopy` does before I 
commit to a particular strategy. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to