[jira] [Resolved] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

Sean R. Owen (Jira) Mon, 17 Feb 2020 07:32:08 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean R. Owen resolved SPARK-29089.
----------------------------------
    Fix Version/s: 3.1.0
       Resolution: Fixed

Issue resolved by pull request 25899
[https://github.com/apache/spark/pull/25899]

> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29089
>                 URL: https://issues.apache.org/jira/browse/SPARK-29089
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Arwin S Tio
>            Assignee: Arwin S Tio
>            Priority: Minor
>             Fix For: 3.1.0
>
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
> I am currently working on a patch that implements this improvement
>  [0] 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

Reply via email to