GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/5964

    [SPARK-7308][WIP] prevent concurrent attempts for one stage

    Reproduction of multiple concurrent stage attempts, and a fix.  This 
actually doesn't completely solve the problem, but it is a vast improvement.  I 
wanted to put this up so others could take a look at the test case, since the 
DAGScheduler is tricky (the purpose of the PR at this point is mostly just the 
failure reproduction).
    
    I'd recommend reviewers check out a branch which *just* has the failure 
reproduction, without the fix here: 
https://github.com/squito/spark/tree/SPARK-7308_failure_reproduction.  Run 
[this 
test](https://github.com/squito/spark/blob/SPARK-7308_failure_reproduction/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala#L777)
 -- even just watch the logs with `tail -f core/targer/unit-tests.log | grep 
DAGScheduler` and you will see some really weird behavior: Stage 2 has multiple 
concurrent attempts, which appear to stomp all over each other; Stage 3 get 
submitted before Stage 2 ever finishes, and then it will rapidly fire off a 
bunch of attempts which all quickly die (I've seen > 50 attempts); and lots of 
executors continue to get lost, though the test case only simulates one 
executor getting lost.
    
    Even with the fix here atm, we still see concurrent attempts for stage 3, 
and the first several die from corrupted streams.  However, these failures are 
recovered from consistently -- the job completes successfully.  But, that could 
be just b/c the test case isn't complicated enough, so that definitely needs to 
be fixed as well.  I have a vague idea of how its happening and think I can fix 
it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark SPARK-7308_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5964.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5964
    
----
commit d08c20cd1fbb22bb5db191db3d4616e5ed8b6f52
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T00:49:27Z

    tasks know which stageAttempt they belong to

commit 89e8428db2441258597e3962905da6317912cc12
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T03:54:57Z

    reproduce the failure

commit 70a787be6e55605365d84490e0d2072d4c7f5143
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T04:13:23Z

    ignore fetch failure from attempts that are already failed.  only a partial 
fix, still have some concurrent attempts

commit 7fbcefbdb466daca0f492966492e4d7710247810
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T04:15:11Z

    ignore the test for now just to avoid swamping jenkins

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to