steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-549465360 Nice experiment! I guess in-EC2, you're limited by the number of course but also latency is nice and low. Remotely, latency is worse so if there is anything we can do in parallel threads -there are some tangible benefits. in both local and remote S3 interaction rename() is faked with a COPY, which is 6-10MB/s; that can be done via the thread pool too if you can configure the AWS SDK to split up a large copy into parallel parts. That shares the same pools, so its useful to have some capacity there on any process renaming things.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
