[ 
https://issues.apache.org/jira/browse/HAMA-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284628#comment-13284628
 ] 

Thomas Jungblut commented on HAMA-505:
--------------------------------------

Suraj and I had a small meeting on FT in 0.6.0, here are our first iteration 
result:

First focus on 0.6.0

# Checkpointing on receive side HAMA-557
## ZK stores successful superstep checkpointing files / paths
# When fault happens:
## Single task recovery (when fault happens inside of computation)
### Groom detects failure, flag the task as fail and redirects a new task 
schedule to the scheduler(HAMA-534), BSPTask#run takes care of correct filling 
of message queue in BSPPeerImpl and MessageManager.
## Global recovery (when fault happens during sync or checkpointing)
### All tasks must fail and rescheduled with the last successful superstep
# Restart the task(s) with Superstep API HAMA-533
## Improve Superstep API with HAMA-546
## Improve Superstep API or rather BSP API with following features:
### deregister/close (empty the BSP slot)
### relieve from sync .. the task runs but would not sync anymore
                
> Fault Tolerant Job Processing
> -----------------------------
>
>                 Key: HAMA-505
>                 URL: https://issues.apache.org/jira/browse/HAMA-505
>             Project: Hama
>          Issue Type: Umbrella
>            Reporter: Thomas Jungblut
>
> This umbrella summarizes all issues related with checkpointing and task 
> restarting to archieve fault tolerance on the job level.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to