[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

P. Taylor Goetz resolved STORM-682.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 0.9.4
                   0.10.0

> Supervisor local worker state corrupted and failing to start.
> -------------------------------------------------------------
>
>                 Key: STORM-682
>                 URL: https://issues.apache.org/jira/browse/STORM-682
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Parth Brahmbhatt
>            Assignee: Parth Brahmbhatt
>             Fix For: 0.10.0, 0.9.4
>
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>      versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>      latest_version = max(versions)
>      read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to