turboFei commented on issue #25863: [WIP][SPARK-29037][CORE][SQL] For static 
partition overwrite, spark may give duplicate result.
URL: https://github.com/apache/spark/pull/25863#issuecomment-534679302
 
 
   > I will try to set our own commitJob method to FileOutputCommitter by using 
reflection to fix the UT issue.
   
   For the UT issue, there  is sub class of ParquetOutputCommitter, which 
overrides the commitJob method.
   So, I can not pass the UT with our own commitJob method.
   
   I have try to set `tablePath` to field of committer named  "outputPath", 
when FileOutputCommitter has been initialed. 
   But the  `getAllCommittedTaskPaths` method will use outputPath  to get all 
committed task paths.
   
   So there is not a perfect solution.
   
   I think we can set `mapreduce.fileoutputcommitter.algorithm.version` to 2 
implicitly for InsertIntoHadoopFsRelation operation, which would commit task 
output to stagingOutputPath directly.
   Then we can merge the output under stagingOutputPath to tablePath.
   And the cost is same with previous implementation when 
`mapreduce.fileoutputcommitter.algorithm.version` is set to 1.
   And it also would not cause partial result.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to