[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

Steve Loughran (JIRA) Tue, 03 Jan 2017 05:32:18 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15795069#comment-15795069
 ]


Steve Loughran commented on SPARK-18917:
----------------------------------------

looking at the code being optionally disabled, the underlying problem is that 
it's doing a recursive treewalk of the filesystem paths, which is a performance 
killer on object stores, which is issuing 3-4 HTTPS requests per directory.

HADOOP-13208 makes the recursive {{FileSystem.listFiles(path, recursive=true)}} 
call an {{O(leaf-files/5000)}}, irrespective of directory structure. This will 
be in Hadoop 2.8 and things derived from that code.

Moving {{PartitioningAwareFileIndex.bulkListLeafFiles()}} to using that method 
will deliver the speedup without having to add & test a new option.

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18917
>                 URL: https://issues.apache.org/jira/browse/SPARK-18917
>             Project: Spark
>          Issue Type: Improvement
>          Components: EC2, SQL, YARN
>    Affects Versions: 2.0.2
>            Reporter: Anbu Cheeralan
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

Reply via email to