Hi, guys, I have a question regarding how Giraph restarts from last checkpoint due to worker_failure.
I run an example with 5 workers and 1 master. Two workers are preempted during running. But I found the other 3 workers also quit. I check the code, and find the following in the BspServiceWorker.processEvent(WatchedEvent event): if ((ApplicationState.valueOf(jsonObj.getString(JSONOBJ_STATE_KEY)) == ApplicationState.START_SUPERSTEP) && jsonObj.getLong(JSONOBJ_APPLICATION_ATTEMPT_KEY) != getApplicationAttempt()) { LOG.fatal("processEvent: Worker will restart " + "from command - " + jsonObj.toString()); System.exit(-1); } Does this mean all ''good'' workers also need to quit and the job needs to request resources again? BTW, I use the pure-YARN with Giraph-1.1.0-SNAPSHOT. The following is the log from one "good" worker: 2014-04-29 21:56:55,284 INFO [main-EventThread] worker.BspServiceWorker (BspServiceWorker.java:processEvent(1604)) - processEvent: Job state changed, checking to see if it needs to restart 2014-04-29 21:56:55,285 INFO [main-EventThread] bsp.BspService (BspService.java:getJobState(695)) - getJobState: Job state already exists (/_hadoopBsp/giraph_yarn_application_1398826558049_0001/_masterJobState) 2014-04-29 21:56:55,287 FATAL [main-EventThread] worker.BspServiceWorker (BspServiceWorker.java:processEvent(1619)) - processEvent: Worker will restart from command - {"_stateKey":"START_SUPERSTEP","_applicationAttemptKey":1,"_superstepKey":24} Thanks for help!