Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/21286
@jinxing64 from my reading of the code, the original patch proposed
creating a temp dir for every query, which could then do its own work & cleanup
in parallel, with a new meta-commit on each job commit, moving stuff from this
per-job temp dir into the final dest.
This is to address
* conflict of work in the `_temporary/0` path
* rm of `_temporary` in job abort, post-commit cleanup
And the reason for that '0' is that spark's job id is just a counter of
queries done from app start, whereas on hadoop MR it's unique for across a live
YARN cluster. Spark deploys in different ways, and can't rely on that value.
The job id discussion proposes generating unique job IDs for every spark
app, so allowing `_temporary/$jobID1` to work alongside ``_temporary/$jobID2`.
With that *and disabling cleanup in the FileOutputCommitter
(`mapreduce.fileoutputcommitter.cleanup.skipped`), @zheh12 should get what they
need: parallel queries to same dest using FileOutputCommitter without conflict
of temp data
> Thus the change outside committer and doesn't break commiterr's logic.
Did I understand correctly ?
Exactly. It also makes it a simpler change, which is good as the commit
algorithms are pretty complex and its hard to test all the failure modes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]