Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change 
notification.

The "BSPMaster" page has been changed by ChiaHungLin:
https://wiki.apache.org/hama/BSPMaster?action=diff&rev1=20&rev2=21

  == Scenario ==
  
   * Restart
-   * When a '''reported''' task fails on a groom server, restart that job by 
re-running '''all''' tasks from the latest checkpoint that universally 
available. The reason not merely re-running the task that fails comes from the 
fact that universally available checkpoint may not be only one step behind the 
current superstep. This may lead to the deadlock between alive tasks and the 
restarted one during sync phase. For example, the universally checkpoint 
available is the 6th superstep, and currently running the computation from the 
7th to 8th superstep. Suppose one of the tasks fails, then the system migrates 
the failed task to another machine and resumes the failed task from the 6th 
superstep checkpoint whilst other tasks keep continuously running until hitting 
the barrier sync at the superstep 8th. Now the dead lock is raised when the 
resumed task, that previous fails, hits the barrier sync at the superstep 7th 
because no other tasks are at the superstep 7th. There is one proposed solution 
to fix a task failure issue. A more complicated logic can be done for this 
issue, but right now may just impmlement the simpler one. 
+   * When a '''reported''' task fails on a groom server, restart that job by 
re-running '''all''' tasks from the latest checkpoint that universally 
available. The reason not merely re-running the task that fails comes from the 
fact that universally available checkpoint may not be only one step behind the 
current superstep. This may lead to the deadlock between alive tasks and the 
restarted one during sync phase. For example, the universally checkpoint 
available is the 6th superstep, and currently running the computation from the 
7th to 8th superstep. Suppose one of the tasks fails, then the system migrates 
the failed task to another machine and resumes the failed task from the 6th 
superstep checkpoint whilst other tasks keep continuously running until hitting 
the barrier sync at the superstep 8th. Now the dead lock is raised when the 
resumed task, that previous fails, hits the barrier sync at the superstep 7th 
because no other tasks are at the superstep 7th. There is one proposed solution 
to fix a task failure issue. A more complicated logic can be applied for this 
issue, but right now may just implement the simpler one. 
   
  
  == Source ==

Reply via email to