Mark Hamstra created SPARK-1685:
-----------------------------------
Summary: retryTimer not canceled on actor restart in Worker and
AppClient
Key: SPARK-1685
URL: https://issues.apache.org/jira/browse/SPARK-1685
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster
when those Actors start. The attempt at registration is accomplished by
starting a retryTimer via the Akka scheduler that will use the registered
timeout interval and retry number to make repeated attempts to register with
all known Masters before giving up and either marking as dead or calling
System.exit.
The receive methods of these actors can, however, throw exceptions, which will
lead to the actors restarting, registerWithMaster being called again on
restart, and another retryTimer being scheduled without canceling the already
running retryTimer. Assuming that all of the rest of the restart logic is
correct for these actors (which I don't believe is actually a given), having
multiple retryTimers running presents at least a condition in which the
restarted actor will not be able to make the full number of retry attempts
before an earlier retryTimer takes the "give up" action.
Canceling the retryTimer in the actor's postStop hook should suffice.
--
This message was sent by Atlassian JIRA
(v6.2#6252)