Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/21286
...this makes me think that the FileOutputCommitter actually has an
assumption that nobody has called out before, specifically "only one
application will be writing data to the target FS with the same job id". It's
probably been implicit in MR with a local HDFS for a long time, first on the
assumption of all jobs getting unique job Ids from the same central source
*and* nothing outside the cluster writing to the same destinations. With cloud
stores, that doesn't hold; it's conceivable that >1 YARN cluster could start
jobs with the same dest. As the timestamp of YARN launch is used as the initial
part of the identifier, if >1 cluster was launched in the same minute, things
are lined up to collide. Oops.
FWIW, the parsing code I mentioned is
{{org.apache.hadoop.mapreduce.JobID.forName()}}: any numbering scheme spark
uses should be able to map from a string to a job ID through that & back again.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]