steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-536223420
 
 
   > fs.s3a.connection.maximum 
   
   its 30 AFAIK. I should revisit that you know, there was never reason for it 
other than some uses can overload things (e.g hive doing many user instances_).
   
   * Impala runs with thousands; you need to bump up the thread pool too. 
   * if you have spark workers and they all work with the same few buckets: go 
big
   * if you have spark workers working with different buckets, balance the 
capacity
   
   generally, for metadata ops (head, list) and for copy ops used in rename, 
those connections don't overload the client..they are waiting for things to 
happen. It's the GET and PUT which use up bandwidth. 
   
   Why don't you submit a Hadoop PR which bumps the default value to some 
higher number which you can all agree on, and I'll review. We could certainly 
do some 64-100; above that gets harder to defend
   
   (I'm ignoring actual throttling of S3 REST calls; they span your entire 
cluster)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to