turboFei commented on issue #26159: [SPARK-29506][SQL] Use dynamicPartitionOverwrite in FileCommitProtocol when insert into hive table URL: https://github.com/apache/spark/pull/26159#issuecomment-544333629 > For hive table insertion, we insert to a fresh staging dir first. So dynamicPartitionOverwrite and normal write are logically the same IIUC. Do you mean dynamicPartitionOverwrite is better for performance? Hi, @cloud-fan I think dynamicPartitionOverwrite would keep a filesToMove set and the times to rename file is partitions-num. https://github.com/apache/spark/blob/f4d5aa42139ff8412c573c96a1631ef3ccf81844/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L181-L183 And normal write(let file output committer algorithm version to 1), per task would commit output to a temp path. For example: _temporary/0/task_attempt_1/p1=v1/p2=v2/task1.parquet _temporary/0/task_attempt_2/p1=v1/p2=v2/task2.parquet _temporary/0/task_attempt_3/p1=v1/p2=v2/task3.parquet After all tasks completed, it would invoke mergePaths to merge these output. The cost is larger for a partitioned table than dynamicPartitionOverwrite. But there is a known issue for dynamicPartitionOverwrite, a task may conflict with its speculation task. I have created a PR https://github.com/apache/spark/pull/26086, can you help take a look?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
