[ 
https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1685.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9.2
                   1.0.0

> retryTimer not canceled on actor restart in Worker and AppClient
> ----------------------------------------------------------------
>
>                 Key: SPARK-1685
>                 URL: https://issues.apache.org/jira/browse/SPARK-1685
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.9.0, 1.0.0, 0.9.1
>            Reporter: Mark Hamstra
>            Assignee: Mark Hamstra
>             Fix For: 1.0.0, 0.9.2
>
>
> Both deploy.worker.Worker and deploy.client.AppClient try to 
> registerWithMaster when those Actors start.  The attempt at registration is 
> accomplished by starting a retryTimer via the Akka scheduler that will use 
> the registered timeout interval and retry number to make repeated attempts to 
> register with all known Masters before giving up and either marking as dead 
> or calling System.exit.
> The receive methods of these actors can, however, throw exceptions, which 
> will lead to the actor restarting, registerWithMaster being called again on 
> restart, and another retryTimer being scheduled without canceling the already 
> running retryTimer.  Assuming that all of the rest of the restart logic is 
> correct for these actors (which I don't believe is actually a given), having 
> multiple retryTimers running presents at least a condition in which the 
> restarted actor may not be able to make the full number of retry attempts 
> before an earlier retryTimer takes the "give up" action.
> Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to