[GitHub] [spark] steveloughran commented on pull request #30319: [SPARK-33402][CORE] SparkHadoopWriter to set unique job ID in "spark.sql.sources.writeJobUUID"

GitBox Wed, 11 Nov 2020 03:24:56 -0800


steveloughran commented on pull request #30319:
URL: https://github.com/apache/spark/pull/30319#issuecomment-725368337



   I'm thinking it may be best to not only do this but do better randomness in 
the creation of the timestamp of the artificial app attempt ID. Today: Date() 
is used, hence the conflict if >1 job is started in the same second. The 
staging committer is most vulnerable to this, but if someone uses 
FileOutputCommitter to the same destination dir *and* overwrite is enabled, the 
same conflict occurs
   
   I'm making sure the S3a committers pick up this UUID everywhere (staging 
already does for the clusterfs, but not for local task attempt dir). What I'm 
not going to go near is the classic FileOutputCommitter, for the following 
reason: fear
   
   I don't want to go anywhere near that committer as it is way too critical, 
and it contains deep assumptions that application attempt IDs are sequential; 
Hadoop MR uses that for recoverability on restarted job attempts. Spark doesn't 
worry about failed drivers, so doesn't need that sequential naming,


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on pull request #30319: [SPARK-33402][CORE] SparkHadoopWriter to set unique job ID in "spark.sql.sources.writeJobUUID"

Reply via email to