[
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
P. Taylor Goetz resolved STORM-682.
-----------------------------------
Resolution: Fixed
Fix Version/s: 0.9.4
0.10.0
> Supervisor local worker state corrupted and failing to start.
> -------------------------------------------------------------
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Parth Brahmbhatt
> Assignee: Parth Brahmbhatt
> Fix For: 0.10.0, 0.9.4
>
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the
> local state of the supervisors get corrupted.The only way to recover the
> supervisor from this state is to delete the local state folder where
> supervisor stores all worker information.This fix can get very cumbersome if
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned
> store files are created vs the deletion order of those files. LocalState.put
> first creates a data file X and then marks a success by creating a file
> X.version. During get it first checks for all *.version files , tries to
> find the largest value of X and then issues a read against X. See the below
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
> versions = `ls local-state/workers/workerId/heartbeats/*.version`
> latest_version = max(versions)
> read local-state/workers/workerId/heartbeats/latest_version [Note there
> is no .version extension]
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets
> deleted but X.version fails to delete the supervisor fails to start with
> FileNotFoundException in the code above.
> We propose to change the deletion order so the .version files get deleted
> before the data file and catch any IOException when reading worker heartbeats
> to avoid supervisor failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)