Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/19848
Job is is only used in the normal FileOutputCommitter to generate unique
paths, using`s" _temporary/$jobid_$job-attempt"` for the file (ie.
job-attempt-ID, which is jobID+attempt).
When trying to recover a job (v1 algorithm only), it works out its current
job Id and attempt, and looks for committed task directories in the dir
$jobId-${attemptId-1}; moves them into the current attempt as completed, sets
off to do the remainder. Relies on rename of task attempt directories to be
atomic, assumes that they are 0(1).
Stocator (ask @gilv) uses the job attempt ID for naming the final files
created; I don't know the implications there, but given it's written for Spark,
you can assume the current numbering scheme works.
I don't know of anything which assumes that jobIDs (and transitively)
jobAttemptIDs, taskAttemptIDs) are UUIDs. Might be worth specifying that there
is no such guarantee in whatever docs get written up.
The main issue with reusing the job ID will be if there is any execution
where >1 job is attempting to write/overwrite data in the same directory tree
(i.e. adding new partitions to an existing dataset, in situ). That's a pretty
dangerous thing to be doing, and given the current FileOutputCommitter does a
`rm $dest/_temporary` at the end of a commit, currently doomed. Whichever job
commits first blocks the other from committing (I don't know if that's
intentional, or just a side effect of the cleanup logic). Similarly the new S3A
commiters cancel all pending multipart uplaods to a dir: killing outstanding
jobs.
If Spark SQL plans to support simultaneous writes to the same dest dir,
well, more than just the the job ID needs to change. So don't worry about it
until then,
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]