Have a question regarding restart from last checkpoint

Wei Yan Tue, 29 Apr 2014 20:33:14 -0700

Hi, guys,

I have a question regarding how Giraph restarts from last checkpoint due to
worker_failure.


I run an example with 5 workers and 1 master. Two workers are preempted
during running. But I found the other 3 workers also quit. I check the
code, and find the following in the
BspServiceWorker.processEvent(WatchedEvent event):

if ((ApplicationState.valueOf(jsonObj.getString(JSONOBJ_STATE_KEY)) ==
    ApplicationState.START_SUPERSTEP) &&
    jsonObj.getLong(JSONOBJ_APPLICATION_ATTEMPT_KEY) !=
    getApplicationAttempt()) {
        LOG.fatal("processEvent: Worker will restart " +
            "from command - " + jsonObj.toString());
        System.exit(-1);
}

Does this mean all ''good'' workers also need to quit and the job needs to
request resources again? BTW, I use the pure-YARN with
Giraph-1.1.0-SNAPSHOT.

The following is the log from one "good" worker:

2014-04-29 21:56:55,284 INFO  [main-EventThread] worker.BspServiceWorker
(BspServiceWorker.java:processEvent(1604)) - processEvent: Job state
changed, checking to see if it needs to restart
2014-04-29 21:56:55,285 INFO  [main-EventThread] bsp.BspService
(BspService.java:getJobState(695)) - getJobState: Job state already exists
(/_hadoopBsp/giraph_yarn_application_1398826558049_0001/_masterJobState)
2014-04-29 21:56:55,287 FATAL [main-EventThread] worker.BspServiceWorker
(BspServiceWorker.java:processEvent(1619)) - processEvent: Worker will
restart from command -
{"_stateKey":"START_SUPERSTEP","_applicationAttemptKey":1,"_superstepKey":24}

Thanks for help!

Have a question regarding restart from last checkpoint

Reply via email to