steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-536223420 > fs.s3a.connection.maximum its 30 AFAIK. I should revisit that you know, there was never reason for it other than some uses can overload things (e.g hive doing many user instances_). * Impala runs with thousands; you need to bump up the thread pool too. * if you have spark workers and they all work with the same few buckets: go big * if you have spark workers working with different buckets, balance the capacity generally, for metadata ops (head, list) and for copy ops used in rename, those connections don't overload the client..they are waiting for things to happen. It's the GET and PUT which use up bandwidth. Why don't you submit a Hadoop PR which bumps the default value to some higher number which you can all agree on, and I'll review. We could certainly do some 64-100; above that gets harder to defend (I'm ignoring actual throttling of S3 REST calls; they span your entire cluster)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
