Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    ...this makes me think that the FileOutputCommitter actually has an 
assumption that nobody has called out before, specifically "only one 
application will be writing data to the target FS with the same job id". It's 
probably been implicit in MR with a local HDFS for a long time, first on the 
assumption of all jobs getting unique job Ids from the same central source 
*and* nothing outside the cluster writing to the same destinations. With cloud 
stores, that doesn't hold; it's conceivable that >1  YARN cluster could start 
jobs with the same dest. As the timestamp of YARN launch is used as the initial 
part of the identifier, if >1 cluster was launched in the same minute, things 
are lined up to collide. Oops.
    
    FWIW, the parsing code I mentioned is 
{{org.apache.hadoop.mapreduce.JobID.forName()}}: any numbering scheme spark 
uses should be able to map from a string to a job ID through that & back again.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to