I think the problem is calling globStatus to expand all 300K files.
This is a general problem for object stores and huge numbers of files.
Steve L. may have better thoughts on real solutions. But you might
consider, if possible, running a lot of .csv jobs in parallel to query
subsets of all the files, and union the results. At least there you
parallelize the reading from the object store.

I think it's hard to optimize this case from the Spark side as it's
not clear how big a glob like s3://foo/* is going to be. I think it
would take reimplementing some logic to expand the glob incrementally
or something. Or maybe I am overlooking optimizations that have gone
into Spark 3.

On Fri, Sep 6, 2019 at 7:09 AM Arwin Tio <arwin....@hotmail.com> wrote:
>
> Hello,
>
> On Spark 2.4.4, I am using DataFrameReader#csv to read about 300000 files on 
> S3, and I've noticed that it takes about an hour for it to load the data on 
> the Driver. You can see the timestamp difference when the log from 
> InMemoryFileIndex occurs from 7:45 to 8:54:
>
> 19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
> 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
> ...
> 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
> 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
>
>
> I believe that the issue comes from DataSource#checkAndGlobPathIfNecessary 
> [0], specifically from when it is calling FileSystem#exists. Unlike 
> bulkListLeafFiles, the existence check here happens in a single-threaded 
> flatMap, which is a blocking network call if your files are stored on S3.
>
> I believe that there is a fairly straightforward opportunity for improvement 
> here, which is to parallelize the existence check perhaps with a configurable 
> number of threads. If that seems reasonable, I would like to create a JIRA 
> ticket and submit a patch. Please let me know!
>
> Cheers,
>
> Arwin
>
> [0] 
> https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to