Zaisheng Dai created SPARK-30474:
------------------------------------
Summary: Writing data to parquet with dynamic partition should not
be done in commit job stage
Key: SPARK-30474
URL: https://issues.apache.org/jira/browse/SPARK-30474
Project: Spark
Issue Type: Improvement
Components: Input/Output
Affects Versions: 2.4.4, 2.3.4
Reporter: Zaisheng Dai
In the current spark implementation if you set
spark.sql.sources.partitionOverwriteMode=dynamic, even with
mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the
partition folder *sequentially* in commitJob stage as shown here:
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
This is very slow in cloud storage. We should commit the data similar to
FileOutputCommitter v2?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]