GitHub user rezasafi opened a pull request:
https://github.com/apache/spark/pull/19886
[SPARK-22162][BRANCH-2.2] Executors and the driver should use consistent
JobIDs in the RDD commit protocol
I have modified SparkHadoopMapReduceWriter so that executors and the driver
always use consistent JobIds during the hadoop commit. Before SPARK-18191,
spark always used the rddId, it just incorrectly named the variable stageId.
After SPARK-18191, it used the rddId as the jobId on the driver's side, and the
stageId as the jobId on the executors' side. With this change executors and the
driver will consistently uses rddId as the jobId. Also with this change, during
the hadoop commit protocol spark uses actual stageId to check whether a stage
can be committed unlike before that it was using executors' jobId to do this
check.
In addition to the existing unit tests, a test has been added to check
whether executors and the driver are using the same JobId. The test failed
before this change and passed after applying this fix.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rezasafi/spark stagerdd22
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19886.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19886
----
commit b0f4b495525010b2608148f992f3cf18c231411f
Author: Reza Safi <[email protected]>
Date: 2017-12-04T23:56:27Z
[SPARK-22162][BRANCH-2.2] Executors and the driver should use consistent
JobIDs in the RDD commit protocol
I have modified SparkHadoopWriter so that executors and the driver always
use consistent JobIds during the hadoop commit. Before SPARK-18191, spark
always used the rddId, it just incorrectly named the variable stageId. After
SPARK-18191, it used the rddId as the jobId on the driver's side, and the
stageId as the jobId on the executors' side. With this change executors and the
driver will consistently uses rddId as the jobId. Also with this change, during
the hadoop commit protocol spark uses actual stageId to check whether a stage
can be committed unlike before that it was using executors' jobId to do this
check.
In addition to the existing unit tests, a test has been added to check
whether executors and the driver are using the same JobId. The test failed
before this change and passed after applying this fix.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]