[GitHub] [spark] steveloughran opened a new pull request #30319: [SPARK-33302][CORE] SparkHadoopWriter to set unique job ID in "spark.sql.sources.writeJobUUID"

GitBox Tue, 10 Nov 2020 08:31:47 -0800


steveloughran opened a new pull request #30319:
URL: https://github.com/apache/spark/pull/30319

HADOOP-17332. S3A MarkerTool -min and -max are inverted
https://github.com/apache/hadoop/pull/2425

### What changes were proposed in this pull request?

Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that
`rdd.saveAsNewAPIHadoopDataset` passed in a unique job ID in
`spark.sql.sources.writeJobUUID`

### Why are the changes needed?

Without this, if more than one job is started in the same second *and the
committer expects application attempt IDs to be unique* is at risk of clashing
with other jobs.

With the fix, those committers which use the ID set in
`spark.sql.sources.writeJobUUID` as a priority ID will pick that up instead and
so be unique.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

1. Hadoop-trunk built with
[HADOOP-17318](https://github.com/apache/hadoop/pull/2399), publishing to local
mvn repo
1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up these JARS. For
reasons I don't understand, spark master wouldn't build with that hadoop
version, but an internal spark 2.x branch was happy.
1. Spark + Object store integration tests at
[https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration)
were built against that local spark version
1. And executed against AWS london.

The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a
committers fail fast if they don't get a job ID down. This showed that
`rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again uses
the current Date value for an app attempt -which is not guaranteed to be unique.

With the change applied to spark, the relevant tests work, therefore the
committers are getting unique job IDs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran opened a new pull request #30319: [SPARK-33302][CORE] SparkHadoopWriter to set unique job ID in "spark.sql.sources.writeJobUUID"

Reply via email to