[GitHub] spark pull request #19886: [SPARK-22162][BRANCH-2.2] Executors and the drive...

rezasafi Mon, 04 Dec 2017 16:09:07 -0800

GitHub user rezasafi opened a pull request:

    https://github.com/apache/spark/pull/19886


    [SPARK-22162][BRANCH-2.2] Executors and the driver should use consistent 
JobIDs in the RDD commit protocol

    I have modified SparkHadoopMapReduceWriter so that executors and the driver 
always use consistent JobIds during the hadoop commit. Before SPARK-18191, 
spark always used the rddId, it just incorrectly named the variable stageId. 
After SPARK-18191, it used the rddId as the jobId on the driver's side, and the 
stageId as the jobId on the executors' side. With this change executors and the 
driver will consistently uses rddId as the jobId. Also with this change, during 
the hadoop commit protocol spark uses  actual stageId to check whether a stage 
can be committed unlike before that  it was using executors' jobId to do this 
check.
    In addition to the existing unit tests, a test has been added to check 
whether executors and the driver are using the same JobId. The test failed 
before this change and passed after applying this fix.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rezasafi/spark stagerdd22

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19886.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19886
    
----
commit b0f4b495525010b2608148f992f3cf18c231411f
Author: Reza Safi <[email protected]>
Date:   2017-12-04T23:56:27Z

    [SPARK-22162][BRANCH-2.2] Executors and the driver should use consistent 
JobIDs in the RDD commit protocol
    I have modified SparkHadoopWriter so that executors and the driver always 
use consistent JobIds during the hadoop commit. Before SPARK-18191, spark 
always used the rddId, it just incorrectly named the variable stageId. After 
SPARK-18191, it used the rddId as the jobId on the driver's side, and the 
stageId as the jobId on the executors' side. With this change executors and the 
driver will consistently uses rddId as the jobId. Also with this change, during 
the hadoop commit protocol spark uses  actual stageId to check whether a stage 
can be committed unlike before that  it was using executors' jobId to do this 
check.
    In addition to the existing unit tests, a test has been added to check 
whether executors and the driver are using the same JobId. The test failed 
before this change and passed after applying this fix.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19886: [SPARK-22162][BRANCH-2.2] Executors and the drive...

Reply via email to