Parth Brahmbhatt created STORM-682:
--------------------------------------

             Summary: Supervisor local worker state corrupted and failing to 
start.
                 Key: STORM-682
                 URL: https://issues.apache.org/jira/browse/STORM-682
             Project: Apache Storm
          Issue Type: Bug
            Reporter: Parth Brahmbhatt
            Assignee: Parth Brahmbhatt


If supervisor's cleanup of a worker fails to delete some heartbeat files the 
local state of the supervisors get corrupted.The only way to recover the 
supervisor from this state is to delete the local state folder where supervisor 
stores all worker information.This fix can get very cumbersome if it happens on 
multiple worker nodes.

The root cause of the issue is the order in which worker heartbeat versioned 
store files are created vs the deletion order of those files. LocalState.put 
first creates a data file X and then marks a success by creating a file 
X.version.  During get it first checks for all *.version files , tries to find 
the largest value of X and then issues a read against X. See the below pseudo 
code

{code:java}
start_supervisor() {
workerIds = `ls local-state/workers`
for each workerId in workerIds
     versions =  `ls local-state/workers/workerId/heartbeats/*.version`
     latest_version = max(versions)
     read  local-state/workers/workerId/heartbeats/latest_version [Note there 
is no .version extension] 
}
{code}

During cleanup it first tries to delete file X and then X.version. If X gets 
deleted  but X.version fails to delete the supervisor fails to start with 
FileNotFoundException in the code above. 

We propose to change the deletion order so the .version files get deleted 
before the data file and catch any IOException when reading worker heartbeats 
to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to