turboFei commented on issue #25863: [WIP][SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. URL: https://github.com/apache/spark/pull/25863#issuecomment-534679302 > I will try to set our own commitJob method to FileOutputCommitter by using reflection to fix the UT issue. For the UT issue, there is sub class of ParquetOutputCommitter, which overrides the commitJob method. So, I can not pass the UT with our own commitJob method. I have try to set `tablePath` to field of committer named "outputPath", when FileOutputCommitter has been initialed. But the `getAllCommittedTaskPaths` method will use outputPath to get all committed task paths. So there is not a perfect solution. I think we can set `mapreduce.fileoutputcommitter.algorithm.version` to 2 implicitly for InsertIntoHadoopFsRelation operation, which would commit task output to stagingOutputPath directly. Then we can merge the output under stagingOutputPath to tablePath. And the cost is same with previous implementation when `mapreduce.fileoutputcommitter.algorithm.version` is set to 1. And it also would not cause partial result.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
