[GitHub] [spark] Ngone51 commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode

GitBox Tue, 04 Aug 2020 02:22:46 -0700


Ngone51 commented on a change in pull request #29000:
URL: https://github.com/apache/spark/pull/29000#discussion_r464917882




##########
File path: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -41,13 +41,23 @@ import org.apache.spark.mapred.SparkHadoopMapRedUtil
  * @param jobId the job's or stage's id
  * @param path the job's output path, or null if committer acts as a noop
  * @param dynamicPartitionOverwrite If true, Spark will overwrite partition 
directories at runtime
- *                                  dynamically, i.e., we first write files 
under a staging
- *                                  directory with partition path, e.g.
- *                                  /path/to/staging/a=1/b=1/xxx.parquet. When 
committing the job,
- *                                  we first clean up the corresponding 
partition directories at
- *                                  destination path, e.g. 
/path/to/destination/a=1/b=1, and move
- *                                  files from staging directory to the 
corresponding partition
- *                                  directories under destination path.
+ *                                  dynamically, i.e., we first write files to 
task attempt paths
+ *                                  under a staging directory, e.g.
+ *                                  
/path/to/outputPath/.spark-staging-{jobId}/_temporary/
+ *                                  
{appAttemptId}/_temporary/{taskAttemptId}/a=1/b=1/xxx.parquet.
+ *                                  1. When [[FileOutputCommitter]] algorithm 
version set to 1,
+ *                                  we firstly move files from task attempt
+ *                                  paths to corresponding partition 
directories under the staging
+ *                                  directory during committing job, e.g.
+ *                                  
/path/to/outputPath/.spark-staging-{jobId}/a=1/b=1.
+ *                                  Secondly, move the partition directories 
under staging
+ *                                  directory to destination path, e.g. 
/path/to/outputPath/a=1/b=1
+ *                                  2. When [[FileOutputCommitter]] algorithm 
version set to 2,
+ *                                  committing tasks directly move files to 
staging directory,
+ *                                  e.g. 
/path/to/outputPath/.spark-staging-{jobId}/a=1/b=1.
+ *                                  Then move this partition directories under 
staging directory
+ *                                  to destination path during job committing, 
e.g.
+ *                                  /path/to/outputPath/a=1/b=1

Review comment:
       Thanks for your detailed explanation. But since these are underlying 
details rather than implemented by Spark, I think it's better to simplify it. 
e.g., we first..., then move ... from ... to... and move to ... at the end 
during the job committing.
   
   WDYT?  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Ngone51 commented on a change in pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode

Reply via email to