GitHub user squito opened a pull request:
https://github.com/apache/spark/pull/5964
[SPARK-7308][WIP] prevent concurrent attempts for one stage
Reproduction of multiple concurrent stage attempts, and a fix. This
actually doesn't completely solve the problem, but it is a vast improvement. I
wanted to put this up so others could take a look at the test case, since the
DAGScheduler is tricky (the purpose of the PR at this point is mostly just the
failure reproduction).
I'd recommend reviewers check out a branch which *just* has the failure
reproduction, without the fix here:
https://github.com/squito/spark/tree/SPARK-7308_failure_reproduction. Run
[this
test](https://github.com/squito/spark/blob/SPARK-7308_failure_reproduction/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala#L777)
-- even just watch the logs with `tail -f core/targer/unit-tests.log | grep
DAGScheduler` and you will see some really weird behavior: Stage 2 has multiple
concurrent attempts, which appear to stomp all over each other; Stage 3 get
submitted before Stage 2 ever finishes, and then it will rapidly fire off a
bunch of attempts which all quickly die (I've seen > 50 attempts); and lots of
executors continue to get lost, though the test case only simulates one
executor getting lost.
Even with the fix here atm, we still see concurrent attempts for stage 3,
and the first several die from corrupted streams. However, these failures are
recovered from consistently -- the job completes successfully. But, that could
be just b/c the test case isn't complicated enough, so that definitely needs to
be fixed as well. I have a vague idea of how its happening and think I can fix
it.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/squito/spark SPARK-7308_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5964.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5964
----
commit d08c20cd1fbb22bb5db191db3d4616e5ed8b6f52
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T00:49:27Z
tasks know which stageAttempt they belong to
commit 89e8428db2441258597e3962905da6317912cc12
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T03:54:57Z
reproduce the failure
commit 70a787be6e55605365d84490e0d2072d4c7f5143
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T04:13:23Z
ignore fetch failure from attempts that are already failed. only a partial
fix, still have some concurrent attempts
commit 7fbcefbdb466daca0f492966492e4d7710247810
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T04:15:11Z
ignore the test for now just to avoid swamping jenkins
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]