steveloughran opened a new pull request #30319: URL: https://github.com/apache/spark/pull/30319
HADOOP-17332. S3A MarkerTool -min and -max are inverted https://github.com/apache/hadoop/pull/2425 ### What changes were proposed in this pull request? Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that `rdd.saveAsNewAPIHadoopDataset` passed in a unique job ID in `spark.sql.sources.writeJobUUID` ### Why are the changes needed? Without this, if more than one job is started in the same second *and the committer expects application attempt IDs to be unique* is at risk of clashing with other jobs. With the fix, those committers which use the ID set in `spark.sql.sources.writeJobUUID` as a priority ID will pick that up instead and so be unique. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. Hadoop-trunk built with [HADOOP-17318](https://github.com/apache/hadoop/pull/2399), publishing to local mvn repo 1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up these JARS. For reasons I don't understand, spark master wouldn't build with that hadoop version, but an internal spark 2.x branch was happy. 1. Spark + Object store integration tests at [https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration) were built against that local spark version 1. And executed against AWS london. The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a committers fail fast if they don't get a job ID down. This showed that `rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again uses the current Date value for an app attempt -which is not guaranteed to be unique. With the change applied to spark, the relevant tests work, therefore the committers are getting unique job IDs. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
