turboFei commented on issue #26159: [SPARK-29506][SQL] Use 
dynamicPartitionOverwrite in FileCommitProtocol when insert into hive table
URL: https://github.com/apache/spark/pull/26159#issuecomment-544333629
 
 
   > For hive table insertion, we insert to a fresh staging dir first. So 
dynamicPartitionOverwrite and normal write are logically the same IIUC. Do you 
mean dynamicPartitionOverwrite is better for performance?
   
   Hi, @cloud-fan I think dynamicPartitionOverwrite would keep a filesToMove 
set and  the times to rename file is partitions-num.
   
https://github.com/apache/spark/blob/f4d5aa42139ff8412c573c96a1631ef3ccf81844/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L181-L183
   
   And normal write(let file output committer algorithm version to 1), per task 
would commit output to a temp path.
   For example:
   _temporary/0/task_attempt_1/p1=v1/p2=v2/task1.parquet
   _temporary/0/task_attempt_2/p1=v1/p2=v2/task2.parquet
   _temporary/0/task_attempt_3/p1=v1/p2=v2/task3.parquet
   
   After all tasks completed,  it would invoke mergePaths to merge these output.
   The cost is larger for a partitioned table than dynamicPartitionOverwrite.
   
   But there is a known issue for dynamicPartitionOverwrite,  a task may 
conflict with its speculation task.
   I have created a PR https://github.com/apache/spark/pull/26086, can you help 
take a look?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to