[
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329449#comment-14329449
]
ASF GitHub Bot commented on STORM-682:
--------------------------------------
GitHub user Parth-Brahmbhatt opened a pull request:
https://github.com/apache/storm/pull/437
STORM-682: supervisor should handle worker state corruption gracefully.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Parth-Brahmbhatt/incubator-storm STORM-682
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/437.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #437
----
commit afd8f81ba2650423184be3fcf6e00dd7c558acbe
Author: Parth Brahmbhatt <[email protected]>
Date: 2015-02-20T19:56:22Z
STORM-682: supervisor should handle worker state corruption gracefully.
----
> Supervisor local worker state corrupted and failing to start.
> -------------------------------------------------------------
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Parth Brahmbhatt
> Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the
> local state of the supervisors get corrupted.The only way to recover the
> supervisor from this state is to delete the local state folder where
> supervisor stores all worker information.This fix can get very cumbersome if
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned
> store files are created vs the deletion order of those files. LocalState.put
> first creates a data file X and then marks a success by creating a file
> X.version. During get it first checks for all *.version files , tries to
> find the largest value of X and then issues a read against X. See the below
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
> versions = `ls local-state/workers/workerId/heartbeats/*.version`
> latest_version = max(versions)
> read local-state/workers/workerId/heartbeats/latest_version [Note there
> is no .version extension]
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets
> deleted but X.version fails to delete the supervisor fails to start with
> FileNotFoundException in the code above.
> We propose to change the deletion order so the .version files get deleted
> before the data file and catch any IOException when reading worker heartbeats
> to avoid supervisor failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)