Zaisheng Dai created SPARK-30474:
------------------------------------

             Summary: Writing data to parquet with dynamic partition should not 
be done in commit job stage
                 Key: SPARK-30474
                 URL: https://issues.apache.org/jira/browse/SPARK-30474
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output
    Affects Versions: 2.4.4, 2.3.4
            Reporter: Zaisheng Dai


In the current spark implementation if you set 
spark.sql.sources.partitionOverwriteMode=dynamic, even with 
mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
partition folder *sequentially* in commitJob stage as shown here: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to