[
https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433048#comment-13433048
]
Maja Kabiljo commented on GIRAPH-298:
-------------------------------------
I've been (unsuccessfully) trying to figure out how automatic restarting from
checkpoint works. Please correct me where I am wrong, this is how I see it
after investigating with the example and looking in the code:
Worker registers its health in the beginning of superstep. Master enters
BspServiceMaster.barrierOnWorkerList, from which it exits with false only if
some worker didn't register its health - i.e. crashed before starting superstep
computation. This is the only case in which we'll come to
SuperstepState.WORKER_FAILURE. If a worker crashes during superstep
computations, master will stay in the loop in barrierOnWorkerList, and
eventually crash because of Zookeeper. All the others crash then also. Hadoop
restarts them, but I don't see a place where we set which superstep should we
restart from after that.
> TestAutoCheckpoint doesn't restart from checkpoint
> --------------------------------------------------
>
> Key: GIRAPH-298
> URL: https://issues.apache.org/jira/browse/GIRAPH-298
> Project: Giraph
> Issue Type: Bug
> Reporter: Maja Kabiljo
>
> When we run TestAutoCheckpoint, after one worker failure master and all other
> workers also fail. All of them get restarted, but they restart from the
> beginning, not from the last checkpointed superstep.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira