Github user zheh12 commented on the issue: https://github.com/apache/spark/pull/21286 I think the Hadoop design does not allow two jobs to share the same output folder. Hadoop has a related patch that can partially solve this problem. You can configure the parameters to not clean up the _temporary directory. But I think this is not a good solution. [MAPREDUCE-6478. Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob.](https://issues.apache.org/jira/browse/MAPREDUCE-6478?attachmentSortBy=fileName) For this problem, we'd better use different temporary output directories for different jobs, and then copy the files. However, the current implementation breaks some unit tests. There are two ways to fix it. 1. Add the check of presence of tempDir in `HadoopMapReduceCommitProtocal.commitJob`, but this requires an external set `FileOutputFormat.setOutputPath(job, s".temp-${commiter.getJobId()}")` 2. Another approach is that we enable the tempDir directory for all `HadoopMapReduceCommitProtocal`. Â The shield tempDir setting problem, but for all jobs will be one more files move. cc @cloud-fan. Which do you think is better? Please give me some advice?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org