GitHub user rezasafi opened a pull request:
https://github.com/apache/spark/pull/19848
[SPARK-22162] Executors and the driver should use consistent JobIDs in the
RDD commit protocol
I have modified SparkHadoopWriter so that executors and the driver always
use consistent JobIds during the hadoop commit. Before SPARK-18191, spark
always used the rddId, it just incorrectly named the variable stageId. After
SPARK-18191, it used the rddId in the driver, and the stageId in the executors.
With this change, FileCommitProtocol now has a commitTask method that will
receive rddId as the JobId in addition to the stageId. Then during the hadoop
commit protocol, the jobId will be used by hadoop while spark can still uses
stageId like before. This way executors and the driver will consistently uses
stageId.
In addition to the existing unit tests, a test has been added to check
whether executors and the driver are using the same JobId. The test failed
before this change and passed after applying this fix.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rezasafi/spark stagerddsimple
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19848.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19848
----
commit 4dbdbe77435630e3b35581c59189ec75c9c2484d
Author: Reza Safi <[email protected]>
Date: 2017-11-28T23:03:37Z
[SPARK-22162] Executors and the driver should use consistent JobIDs in the
RDD commit protocol
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]