Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change 
notification.

The "BSPMaster" page has been changed by ChiaHungLin:
https://wiki.apache.org/hama/BSPMaster?action=diff&rev1=17&rev2=18

  ----
   * [[Registrator|Registrator]]
   * Receptionist 
-  * JobOperator
   * Scheduler
   * ResourceConsultant
   * GroomManager
+  * Monitor
-  * Supervisor(?)
+   * Supervisor(?)
   
- 
  
  == State ==
  Two states are applied to BSPMaster node, including:
@@ -40, +39 @@

   * STOPPED
  {{attachment:bspmaster_state3.png|BSPMaster State}}
  
+ == Scenario ==
+ 
+  * Restart
+   * When a task fails on a groom server, restart that job by re-running 
'''all''' tasks from the latest checkpoint that universally available. The 
reason not merely re-running the task that fails comes from the fact that 
universally available checkpoint may not be only one step behind the current 
superstep. This may lead to the deadlock between alive tasks and the restarted 
one during sync phase. For example, the universally checkpoint available is the 
6th superstep, and currently running the computation from the 7th to 8th 
superstep. Suppose one of the tasks fails, then the system migrates the failed 
task to another machine and resumes the failed task from the 6th superstep 
checkpoint whilst other tasks keep continuously running until hitting the 
barrier sync at the superstep 8th. Now the dead lock is raised when the resumed 
task, that previous fails, hits the barrier sync at the superstep 7th because 
no other tasks are at the superstep 7th. There is one proposed solution to fix 
a task failure issue. 
+ 
  == Source ==
  
[[http://svn.apache.org/repos/asf/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPMaster.java|BSPMaster.java]]
  

Reply via email to