eeroel commented on issue #38664: URL: https://github.com/apache/arrow/issues/38664#issuecomment-1807048863
Did some benchmarking against AWS CLI (`aws s3 cp`): - AWS CLI also connected to only one IP address so that's probably OK, and not bottlenecking at these rates. I also tested setting https://curl.se/libcurl/c/CURLOPT_DNS_SHUFFLE_ADDRESSES.html but didn't see any performance improvement although connections were made with different IPs. - AWS CLI is 15-20% faster on my computer so there could be some room for optimization in the pre-buffer / cache parameters, but it's not a major difference. Interestingly, the CLI downloads the file mostly in 9MB or 18MB chunks, with the 9MB chunks at half the rate compared to the 18MB ones. - I mentioned above that Colab took 15s to download the file, but this must have been an instance outside of the US, on another instance I get 2-4s download times (both with AWS CLI and pyarrow 13/14) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
