[jira] [Updated] (SPARK-33402) Jobs launched in same second have duplicate JobIDs

Steve Loughran (Jira) Wed, 11 Nov 2020 07:22:59 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-33402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated SPARK-33402:
-----------------------------------
    Description: 
Spark uses the current timestamp to generate a MapReduce JobID.
If > 1 job attempt is generated in the same second, these can clash

Committers which expect this to be unique can conflict with the other jobs

* S3A staging committer (cluster FS staging dir and local task output dir)
* Any committer which supports parallel jobs writing to the same destination
  directory and requires unique names for the attempts
* Code which uses the jobID as part of its algorithm to generate unique 
filenames

Note: {{HadoopMapReduceCommitProtocol.getFilename()}} doesn't use this JobID for
uniqueness, it uses task attempt ID and stage ID. It probably deserves its own
audit.




  was:
SPARK-33230 restored setting a unique job ID in a spark sql job writing through 
the hadoop output formatters, but saving files from an RDD don't because 
SparkHadoopWriter doesn't insert the UUID

Proposed: set the same property


> Jobs launched in same second have duplicate JobIDs
> --------------------------------------------------
>
>                 Key: SPARK-33402
>                 URL: https://issues.apache.org/jira/browse/SPARK-33402
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.8, 3.0.1, 3.1.0
>            Reporter: Steve Loughran
>            Priority: Major
>
> Spark uses the current timestamp to generate a MapReduce JobID.
> If > 1 job attempt is generated in the same second, these can clash
> Committers which expect this to be unique can conflict with the other jobs
> * S3A staging committer (cluster FS staging dir and local task output dir)
> * Any committer which supports parallel jobs writing to the same destination
>   directory and requires unique names for the attempts
> * Code which uses the jobID as part of its algorithm to generate unique 
> filenames
> Note: {{HadoopMapReduceCommitProtocol.getFilename()}} doesn't use this JobID 
> for
> uniqueness, it uses task attempt ID and stage ID. It probably deserves its own
> audit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-33402) Jobs launched in same second have duplicate JobIDs

Reply via email to