[ 
https://issues.apache.org/jira/browse/GIRAPH-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433048#comment-13433048
 ] 

Maja Kabiljo commented on GIRAPH-298:
-------------------------------------

I've been (unsuccessfully) trying to figure out how automatic restarting from 
checkpoint works. Please correct me where I am wrong, this is how I see it 
after investigating with the example and looking in the code:
Worker registers its health in the beginning of superstep. Master enters 
BspServiceMaster.barrierOnWorkerList, from which it exits with false only if 
some worker didn't register its health - i.e. crashed before starting superstep 
computation. This is the only case in which we'll come to 
SuperstepState.WORKER_FAILURE. If a worker crashes during superstep 
computations, master will stay in the loop in barrierOnWorkerList, and 
eventually crash because of Zookeeper. All the others crash then also. Hadoop 
restarts them, but I don't see a place where we set which superstep should we 
restart from after that.
                
> TestAutoCheckpoint doesn't restart from checkpoint
> --------------------------------------------------
>
>                 Key: GIRAPH-298
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-298
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Maja Kabiljo
>
> When we run TestAutoCheckpoint, after one worker failure master and all other 
> workers also fail. All of them get restarted, but they restart from the 
> beginning, not from the last checkpointed superstep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to