[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

zheh12 Fri, 11 May 2018 01:32:29 -0700

Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    I think the Hadoop design does not allow two jobs to share the same output 
folder.
    
    Hadoop has a related patch that can partially solve this problem. You can 
configure the parameters to not clean up the _temporary directory. But I think 
this is not a good solution.
    
    [MAPREDUCE-6478. Add an option to skip cleanupJob stage or ignore cleanup 
failure during 
commitJob.](https://issues.apache.org/jira/browse/MAPREDUCE-6478?attachmentSortBy=fileName)
 
    
    For this problem, we'd better use different temporary output directories 
for different jobs, and then copy the files.
    
    However, the current implementation breaks some unit tests. There are two 
ways to fix it.
    
    1. Add the check of  presence of tempDir in 
`HadoopMapReduceCommitProtocal.commitJob`, but this requires an external set 
`FileOutputFormat.setOutputPath(job, s".temp-${commiter.getJobId()}")`
    
    2. Another approach is that we enable the tempDir directory for all 
`HadoopMapReduceCommitProtocal`.
    Â  The shield tempDir setting problem, but for all jobs will be one more 
files move.
    
    cc @cloud-fan.  Which do you think is better?  Please give me some advice?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Reply via email to