MaoYuan Xian created HAMA-793:
---------------------------------

             Summary: Job failed to recovery when more than one tasks fail at 
the same time even when fault tolerant enabled.
                 Key: HAMA-793
                 URL: https://issues.apache.org/jira/browse/HAMA-793
             Project: Hama
          Issue Type: Bug
          Components: bsp core
    Affects Versions: 0.6.2
            Reporter: MaoYuan Xian
            Priority: Minor


I can find the fault tolerant does not work when more than one tasks fail at 
the same time during a job running.

The reason is, in the schedule method of SimpleTaskScheduler, when finds the 
jobresult equals to false, job.kill called, and than 
JobInProgress.garbageCollection triggered, job directory is clean and makes the 
recovery job fail.

I made the following modifications in the SimpleTaskScheduler and avoid the 
problem. But not sure whether it is the comprehensive solution:
{code}
-      if (Boolean.FALSE.equals(jobResult)) {
+      if ((Boolean.FALSE.equals(jobResult))
+          && (job.getStatus().getRunState() != JobStatus.RECOVERING)) {
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to