cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-534380034
 
 
   I ran additional measurements testing out different thread numbers on the S3 
Landsat data, and it seems like the sweet spot is somewhere between 20-30 
seconds (for my environment anyways)
   
   **30 glob paths paths* - 30 glob paths with the final result of 1206 files
   **single glob path* - 1 single glob path with the final result of 1206 files
   **raw paths* - 1206 raw paths without any globs
   
   see here: https://github.com/apache/spark/pull/25899#issuecomment-534069194
   
   **original code**
   30 glob paths paths _15.6 seconds_
   single glob path _11.3 seconds_
   raw paths 59 seconds_ 
   
   **8 threads**
   30 glob paths paths _1.48 seconds_
   single glob path 11 seconds_
   raw paths _7.73 seconds_
   
   **20 threads**
   30 glob paths paths _1.47 seconds_
   single glob path _15.45 seconds_
   raw paths _4.16 seconds_
   
   **30 threads**
   20 glob paths paths _0.92 seconds_
   single glob path _11.74 seconds_
   raw paths _4.12 seconds_
   
   **40 threads**
   30 glob paths paths _0.93 seconds_
   single glob path _13.48 seconds_
   raw paths _4.08 seconds_
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to