[
https://issues.apache.org/jira/browse/SPARK-33402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated SPARK-33402:
-----------------------------------
Description:
Spark uses the current timestamp to generate a MapReduce JobID.
If > 1 job attempt is generated in the same second, these can clash
Committers which expect this to be unique can conflict with the other jobs
* S3A staging committer (cluster FS staging dir and local task output dir)
* Any committer which supports parallel jobs writing to the same destination
directory and requires unique names for the attempts
* Code which uses the jobID as part of its algorithm to generate unique
filenames
Note: {{HadoopMapReduceCommitProtocol.getFilename()}} doesn't use this JobID for
uniqueness, it uses task attempt ID and stage ID. It probably deserves its own
audit.
was:
SPARK-33230 restored setting a unique job ID in a spark sql job writing through
the hadoop output formatters, but saving files from an RDD don't because
SparkHadoopWriter doesn't insert the UUID
Proposed: set the same property
> Jobs launched in same second have duplicate JobIDs
> --------------------------------------------------
>
> Key: SPARK-33402
> URL: https://issues.apache.org/jira/browse/SPARK-33402
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.8, 3.0.1, 3.1.0
> Reporter: Steve Loughran
> Priority: Major
>
> Spark uses the current timestamp to generate a MapReduce JobID.
> If > 1 job attempt is generated in the same second, these can clash
> Committers which expect this to be unique can conflict with the other jobs
> * S3A staging committer (cluster FS staging dir and local task output dir)
> * Any committer which supports parallel jobs writing to the same destination
> directory and requires unique names for the attempts
> * Code which uses the jobID as part of its algorithm to generate unique
> filenames
> Note: {{HadoopMapReduceCommitProtocol.getFilename()}} doesn't use this JobID
> for
> uniqueness, it uses task attempt ID and stage ID. It probably deserves its own
> audit.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]