[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partition should not be done in commit job stage

Zaisheng Dai (Jira) Thu, 09 Jan 2020 10:21:50 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zaisheng Dai updated SPARK-30474:
---------------------------------
    Description: 
In the current spark implementation if you set 
spark.sql.sources.partitionOverwriteMode=dynamic, even with 
mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
partition folder *sequentially* in commitJob stage as shown here: 

[|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
 
[https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 

  was:
In the current spark implementation if you set 
spark.sql.sources.partitionOverwriteMode=dynamic, even with 
mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
partition folder *sequentially* in commitJob stage as shown here: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 


> Writing data to parquet with dynamic partition should not be done in commit 
> job stage
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-30474
>                 URL: https://issues.apache.org/jira/browse/SPARK-30474
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.3.4, 2.4.4
>            Reporter: Zaisheng Dai
>            Priority: Minor
>
> In the current spark implementation if you set 
> spark.sql.sources.partitionOverwriteMode=dynamic, even with 
> mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
> partition folder *sequentially* in commitJob stage as shown here: 
> [|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
>  
> [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]
>  
> This is very slow in cloud storage. We should commit the data similar to 
> FileOutputCommitter v2?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partition should not be done in commit job stage

Reply via email to