GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/8007
SPARK-8167 Make tasks that fail from YARN preemption not fail job
The architecture is that, in YARN mode, if the driver detects that an
executor has disconnected, it asks the ApplicationMaster why the executor died.
If the ApplicationMaster is aware that the executor died because of preemption,
all tasks associated with that executor are not marked as failed. The executor
is still removed from the driver's list of available executors, however.
There's a few open questions:
1. Should standalone mode have a similar "get executor loss reason" as well?
2. I make a pretty strong assumption in YarnAllocator that
getExecutorLossReason(executorId) will only be called once per executor id; I
do this so that I can remove the metadata from the in-memory map to avoid
object accumulation. It's not clear if I'm being overly zealous to save space,
however.
cc @vanzin specifically for review because it collided with some earlier
YARN scheduling work.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/palantir/spark feature/preemption-handling
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8007.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8007
----
commit 63cd958b4439aaf28581bdde3762e2413577a4bc
Author: mcheah <[email protected]>
Date: 2015-07-07T18:47:01Z
[SPARK-8167] Make tasks that fail from YARN preemption not fail job.
The architecture is that, in YARN mode, if the driver detects that an
executor
has disconnected, it asks the ApplicationMaster why the executor died. If
the
ApplicationMaster is aware that the executor died because of preemption, all
tasks associated with that executor are not marked as failed. The executor
is still removed from the driver's list of available executors, however.
commit 1100ed8d1cb3f504729580af672bf6d6b3eeb624
Author: mcheah <[email protected]>
Date: 2015-07-08T18:34:55Z
Handle multiple onDisconnected events since executor removal doesn't happen
immediately.
Also begin unit tests.
commit 08d2bdd058856b6b1ecb514a8e1066bfacf7518c
Author: mcheah <[email protected]>
Date: 2015-07-09T03:52:49Z
Completing unit tests.
commit 62e43428a3a89dda395bd6c8d47066ae460165a3
Author: mcheah <[email protected]>
Date: 2015-07-09T18:07:25Z
Removing rogue .orig file
commit bd1056ec2554158e1a7eb3509732d89334552fe8
Author: mcheah <[email protected]>
Date: 2015-07-09T18:17:52Z
Put back extends CoarseGrainedClusterMessage
commit 2bbdcfd94c66e82b825977676c76d1cb05c1d9a9
Author: mcheah <[email protected]>
Date: 2015-07-09T18:19:36Z
Removing unnecessary import
commit a697c7cc38ac417eb6a7fed658acd4782824d8fa
Author: mcheah <[email protected]>
Date: 2015-07-09T18:21:28Z
Organizing more imports
commit 3f1104f4639e319d5db2f2c77810308f6f1f3eba
Author: mcheah <[email protected]>
Date: 2015-08-04T19:11:50Z
Merge branch 'preemption-handling-base' into preemption-handling
Conflicts:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala
commit e791cd6378d052ce65d8835d43790b75b8c7c0cd
Author: mcheah <[email protected]>
Date: 2015-08-04T20:51:56Z
Fixing a compiler error caused by the merge from master.
Also I changed the semantics of YarnAllocator.getExecutorLossReason() to
better "clean up" the completed executor map. Also
processCompletedContainers() now always adds an executor reason to the
completed executor exit reason map regardless of exit status as it is
expected for the client to always call getExecutorLossReason().
commit aa69b6fc2fb2db7c98a81bebdf02f23fd18d2b52
Author: mcheah <[email protected]>
Date: 2015-08-06T21:32:45Z
Adding a stronger assertion, and fixing compiler error
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]