turboFei commented on issue #26159: [SPARK-29506][SQL] Use dynamicPartitionOverwrite in FileCommitProtocol when insert into hive table URL: https://github.com/apache/spark/pull/26159#issuecomment-547679404 @rezasafi @viirya For the issue you mentioned. The reason is that the filename is only escaped with taskId and without attemptId. We also met it when spark speculation is enabled, for that a task's filename is conflicted with its speculation task. https://github.com/apache/spark/blob/077fb99a26a9e92104503fade25c0a095fec5e5d/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L108-L118 As shown in the code, I think the risk is same when the output committer is non-FileOutputCommitter, but I have not met this case and I think it is rare. I have read the comments under PR #24142, and it seems that there is a risk for non-FileOutputCommitter case, if a task aborted and failed to clean up its output, result would be duplicated. I created a PR #26086, which fix the issue for dynamic partition overwrite only.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
