Hello, On Spark 2.4.4, I am using DataFrameReader#csv to read about 300000 files on S3, and I've noticed that it takes about an hour for it to load the data on the Driver. You can see the timestamp difference when the log from InMemoryFileIndex occurs from 7:45 to 8:54: 19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4 19/09/06 07:44:42 INFO SparkContext: Submitted application: LoglineParquetGenerator ... 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under: [300K files...]
I believe that the issue comes from DataSource#checkAndGlobPathIfNecessary [0], specifically from when it is calling FileSystem#exists. Unlike bulkListLeafFiles, the existence check here happens in a single-threaded flatMap, which is a blocking network call if your files are stored on S3. I believe that there is a fairly straightforward opportunity for improvement here, which is to parallelize the existence check perhaps with a configurable number of threads. If that seems reasonable, I would like to create a JIRA ticket and submit a patch. Please let me know! Cheers, Arwin [0] https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557