steveloughran opened a new pull request #30319:
URL: https://github.com/apache/spark/pull/30319


   
   HADOOP-17332. S3A MarkerTool -min and -max are inverted
   https://github.com/apache/hadoop/pull/2425
   
   
   ### What changes were proposed in this pull request?
   
   Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that 
`rdd.saveAsNewAPIHadoopDataset` passed in a unique job ID in 
`spark.sql.sources.writeJobUUID`
   
   ### Why are the changes needed?
   
   Without this, if more than one job is started in the same second *and the 
committer expects application attempt IDs to be unique* is at risk of clashing 
with other jobs.
   
   With the fix, those committers which use the ID set in 
`spark.sql.sources.writeJobUUID` as a priority ID will pick that up instead and 
so be unique.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   1. Hadoop-trunk built with 
[HADOOP-17318](https://github.com/apache/hadoop/pull/2399), publishing to local 
mvn repo
   1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up these JARS. For 
reasons I don't understand, spark master wouldn't build with that hadoop 
version, but an internal spark 2.x branch was happy. 
   1. Spark + Object store integration tests at 
[https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration)
 were built against that local spark version
   1. And executed against AWS london.
   
   The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a 
committers fail fast if they don't get a job ID down. This showed that 
`rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again uses 
the current Date value for an app attempt -which is not guaranteed to be unique.
   
   With the change applied to spark, the relevant tests work, therefore the 
committers are getting unique job IDs.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to