Github user vanzin commented on the pull request:
https://github.com/apache/spark/pull/7431#issuecomment-122062974
So now that I tried the new code path (which works), I'm a little skeptical
that sending a message back to the driver is really needed. The driver already
removes the executor when the RPC connection is reset:
15/07/16 12:30:15 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID
4, vanzin-st1-3.vpc.cloudera.com): ExecutorLostFailure (executor 3 lost)
15/07/16 12:30:15 WARN ReliableDeliverySupervisor: Association with
remote system [akka.tcp://[email protected]:36279]
has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/07/16 12:30:15 INFO DAGScheduler: Executor lost: 3 (epoch 0)
15/07/16 12:30:15 INFO BlockManagerMasterEndpoint: Trying to remove
executor 3 from BlockManagerMaster.
15/07/16 12:30:15 INFO BlockManagerMasterEndpoint: Removing block
manager BlockManagerId(3, vanzin-st1-3.vpc.cloudera.com, 37469)
15/07/16 12:30:15 INFO BlockManagerMaster: Removed 3 successfully in
removeExecutor
The new message ends up being a no-op:
15/07/16 12:30:18 ERROR YarnClientSchedulerBackend: Asked to remove
non-existent executor 3
See `CoarseGrainedSchedulerBackend::DriverEndpoint::removeExecutor`.
So I'm a little confused about how this change is fixing anything. The bug
talks about "repeated re-execution of stages" - isn't that the correct way of
handling executor failures? You retry tasks or stages depending on what the
failure is.
Perhaps the real issue you ran into is something like #6750 instead?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]