Arwin S Tio created SPARK-29089:
-----------------------------------
Summary: DataFrameReader bottleneck in
DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files
Key: SPARK-29089
URL: https://issues.apache.org/jira/browse/SPARK-29089
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 2.4.4
Reporter: Arwin S Tio
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've
noticed that it took about an hour for the files to be loaded on the driver.
You can see the timestamp difference when the log from InMemoryFileIndex occurs
from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
19/09/06 07:44:42 INFO SparkContext: Submitted application:
LoglineParquetGenerator
...
19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered
StateStoreCoordinator endpoint
19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories in
parallel under: [300K files...]{quote}
A major source of the bottleneck comes from
DataSource#checkAndGlobPathIfNecessary, which will [(possibly)
glob|[https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L549]]
and do a [[FileSystem#exists||#exists]
[https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557]
[]|#exists] on all the paths in a single thread. On S3, these are slow network
calls.
After a discussion on the mailing list, [mailing
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
it was suggested that an improvement could be to:
* have SparkHadoopUtils differentiate between files returned by globStatus(),
and which therefore exist, and those which it didn't glob for -it will only
need to check those.
* add parallel execution to the glob and existence checks
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]