[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

steveloughran Fri, 15 Dec 2017 04:43:28 -0800

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/19848
  
    Job is is only used in the normal FileOutputCommitter to generate unique 
paths, using`s" _temporary/$jobid_$job-attempt"` for the file (ie. 
job-attempt-ID, which is jobID+attempt). 
    When trying to recover a job (v1 algorithm only), it works out its current 
job Id and attempt, and looks for committed task directories in the dir 
$jobId-${attemptId-1}; moves them into the current attempt as completed, sets 
off to do the remainder. Relies on rename of task attempt directories to be 
atomic, assumes that they are 0(1). 
    
    Stocator (ask @gilv) uses the job attempt ID for naming the final files 
created; I don't know the implications there, but given it's written for Spark, 
you can assume the current numbering scheme works.
    
    I don't know of anything which assumes that jobIDs (and transitively) 
jobAttemptIDs, taskAttemptIDs) are UUIDs. Might be worth specifying that there 
is no such guarantee in whatever docs get written up.
    
    The main issue with reusing the job ID will be if there is any execution 
where >1 job is attempting to write/overwrite data in the same directory tree 
(i.e. adding new partitions to an existing dataset, in situ). That's a pretty 
dangerous thing to be doing, and given the current FileOutputCommitter does a 
`rm $dest/_temporary` at the end of a commit, currently doomed. Whichever job 
commits first blocks the other from committing (I don't know if that's 
intentional, or just a side effect of the cleanup logic). Similarly the new S3A 
commiters cancel all pending multipart uplaods to a dir: killing outstanding 
jobs. 
    
    If Spark SQL plans to support simultaneous writes to the same dest dir, 
well, more than just the the job ID needs to change. So don't worry about it 
until then,



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

Reply via email to