GitHub user rezasafi opened a pull request: https://github.com/apache/spark/pull/19848
[SPARK-22162] Executors and the driver should use consistent JobIDs in the RDD commit protocol I have modified SparkHadoopWriter so that executors and the driver always use consistent JobIds during the hadoop commit. Before SPARK-18191, spark always used the rddId, it just incorrectly named the variable stageId. After SPARK-18191, it used the rddId in the driver, and the stageId in the executors. With this change, FileCommitProtocol now has a commitTask method that will receive rddId as the JobId in addition to the stageId. Then during the hadoop commit protocol, the jobId will be used by hadoop while spark can still uses stageId like before. This way executors and the driver will consistently uses stageId. In addition to the existing unit tests, a test has been added to check whether executors and the driver are using the same JobId. The test failed before this change and passed after applying this fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rezasafi/spark stagerddsimple Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19848.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19848 ---- commit 4dbdbe77435630e3b35581c59189ec75c9c2484d Author: Reza Safi <rezas...@cloudera.com> Date: 2017-11-28T23:03:37Z [SPARK-22162] Executors and the driver should use consistent JobIDs in the RDD commit protocol ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org