[
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen resolved SPARK-29089.
----------------------------------
Fix Version/s: 3.1.0
Resolution: Fixed
Issue resolved by pull request 25899
[https://github.com/apache/spark/pull/25899]
> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when
> reading large amount of S3 files
> ----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Arwin S Tio
> Assignee: Arwin S Tio
> Priority: Minor
> Fix For: 3.1.0
>
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've
> noticed that it took about an hour for the files to be loaded on the driver.
>
> You can see the timestamp difference when the log from InMemoryFileIndex
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
> 19/09/06 07:44:42 INFO SparkContext: Submitted application:
> LoglineParquetGenerator
> ...
> 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered
> StateStoreCoordinator endpoint
> 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories
> in parallel under: [300K files...]
> {quote}
>
> A major source of the bottleneck comes from
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an
> improvement could be to:
>
> * have SparkHadoopUtils differentiate between files returned by
> globStatus(), and which therefore exist, and those which it didn't glob for
> -it will only need to check those.
> * add parallel execution to the glob and existence checks
>
> I am currently working on a patch that implements this improvement
> [0]
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]