[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

steveloughran Wed, 16 May 2018 05:56:14 -0700

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    @jinxing64 from my reading of the code, the original patch proposed 
creating a temp dir for every query, which could then do its own work & cleanup 
in parallel, with a new meta-commit on each job commit, moving stuff from this 
per-job temp dir into the final dest. 
    
    This is to address
    * conflict of work in the `_temporary/0` path
    * rm of `_temporary` in job abort, post-commit cleanup
    
    And the reason for that '0' is that spark's job id is just a counter of 
queries done from app start, whereas on hadoop MR it's unique for across a live 
YARN cluster. Spark deploys in different ways, and can't rely on that value.
    
    The job id discussion proposes generating unique job IDs for every spark 
app, so allowing `_temporary/$jobID1` to work alongside ``_temporary/$jobID2`. 
With that *and disabling cleanup in the FileOutputCommitter 
(`mapreduce.fileoutputcommitter.cleanup.skipped`), @zheh12 should get what they 
need: parallel queries to same dest using FileOutputCommitter without conflict 
of temp data
    
    > Thus the change outside committer and doesn't break commiterr's logic. 
Did I understand correctly ?
    
    Exactly. It also makes it a simpler change, which is good as the commit 
algorithms are pretty complex and its hard to test all the failure modes.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Reply via email to