Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

via GitHub Sun, 19 May 2024 08:42:21 -0700


Tom-Newton commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-2119279863

I've been thinking a bit about a policy we can use to set these parameters
automatically (as @felipecrv suggested). So far I'm thinking each call to
`ReadAt` could be one iteration of an optimisation algorithm e.g. gradient
descent optimising for `ReadAt` duration.

If we imagine varying `chunk_size` from very small to very large the optimal
performance is going to be somewhere in the middle and I expect the optimum
will depend on typical latency and bandwidth between blob storage and the
client.

I've also taken a bit of a look at how
[`azcopy`](https://github.com/Azure/azure-storage-azcopy) implements this since
my experience is that `azcopy` is very fast in a wide variety of situations
without needing to provide any configuration.
I thought it was interesting that it deliberately tries to read out of order
https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/cmd/copyEnumeratorHelper.go#L21
I think the downloading actually happens here
https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/ste/xfer-remoteToLocal-file.go#L307-L332
There is mention of applying some auto-correct policy on the block-size if
the user did not set it.

I'm definitely going to dig a bit deeper into what `azcopy` does before I
commit to a particular strategy.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Reply via email to