cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-534380034 I ran additional measurements testing out different thread numbers on the S3 Landsat data, and it seems like the sweet spot is somewhere between 20-30 seconds (for my environment anyways) **30 glob paths paths* - 30 glob paths with the final result of 1206 files **single glob path* - 1 single glob path with the final result of 1206 files **raw paths* - 1206 raw paths without any globs see here: https://github.com/apache/spark/pull/25899#issuecomment-534069194 **original code** 30 glob paths paths _15.6 seconds_ single glob path _11.3 seconds_ raw paths 59 seconds_ **8 threads** 30 glob paths paths _1.48 seconds_ single glob path 11 seconds_ raw paths _7.73 seconds_ **20 threads** 30 glob paths paths _1.47 seconds_ single glob path _15.45 seconds_ raw paths _4.16 seconds_ **30 threads** 20 glob paths paths _0.92 seconds_ single glob path _11.74 seconds_ raw paths _4.12 seconds_ **40 threads** 30 glob paths paths _0.93 seconds_ single glob path _13.48 seconds_ raw paths _4.08 seconds_
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org