[
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761926#comment-15761926
]
Apache Spark commented on SPARK-18917:
--------------------------------------
User 'alunarbeach' has created a pull request for this issue:
https://github.com/apache/spark/pull/16339
> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> ------------------------------------------------------------------------------
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
> Issue Type: Improvement
> Components: EC2, SQL, YARN
> Affects Versions: 2.0.2
> Reporter: Anbu Cheeralan
> Priority: Minor
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google
> Storage), the writes are taking long time to write/ getting read time out.
> This is because dataframe.write lists all leaf folders in the target
> directory. If there are lot of subfolders due to partitions, this is taking
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write()
> following code causes huge number of RPC calls when the file system is an
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can
> work on the patch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]