GitHub user CodingCat opened a pull request:

    https://github.com/apache/spark/pull/16542

    [SPARK-18905] Fix the issue of removing a failed jobset from 
JobScheduler.jobSets

    ## What changes were proposed in this pull request?
    
    the current implementation of Spark streaming considers a batch is 
completed no matter the results of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
    Let's consider the following case:
    A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of these jobs is failed due to some problem in the user 
defined logic, after the other one is finished successfully.
    1. The main thread in the Spark streaming application will execute the line 
mentioned above,
    2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed.
    3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
    the user recovers from the checkpoint file, and because the JobSet 
containing the failed job has been removed (taken as completed) before the 
checkpoint is constructed, the data being processed by the failed job would 
never be reprocessed
    
    This PR fix it by removing jobset from JobScheduler.jobSets only when all 
jobs in a jobset are successfully finished
    
    ## How was this patch tested?
    
    existing tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/CodingCat/spark SPARK-18905

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16542.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16542
    
----
commit 24bfa382a43c2fbbf54b24bb8f03766910216490
Author: CodingCat <[email protected]>
Date:   2016-03-07T14:37:37Z

    improve the doc for "spark.memory.offHeap.size"

commit 2209e345df4636f8fa881b3ad45084b75f9fe3eb
Author: CodingCat <[email protected]>
Date:   2016-03-07T19:00:16Z

    fix

commit 65623f4408ab6152719046c55093c70435da82c8
Author: Nan Zhu <[email protected]>
Date:   2017-01-11T00:04:55Z

    Merge branch 'master' of https://github.com/apache/spark

commit a8646acf826cfbbabdf0d20129e62a14be404c3b
Author: Nan Zhu <[email protected]>
Date:   2017-01-11T00:24:02Z

    do not remove a jobset with any failed job from jobset to prevent data loss

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to