[
https://issues.apache.org/jira/browse/HAMA-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284628#comment-13284628
]
Thomas Jungblut commented on HAMA-505:
--------------------------------------
Suraj and I had a small meeting on FT in 0.6.0, here are our first iteration
result:
First focus on 0.6.0
# Checkpointing on receive side HAMA-557
## ZK stores successful superstep checkpointing files / paths
# When fault happens:
## Single task recovery (when fault happens inside of computation)
### Groom detects failure, flag the task as fail and redirects a new task
schedule to the scheduler(HAMA-534), BSPTask#run takes care of correct filling
of message queue in BSPPeerImpl and MessageManager.
## Global recovery (when fault happens during sync or checkpointing)
### All tasks must fail and rescheduled with the last successful superstep
# Restart the task(s) with Superstep API HAMA-533
## Improve Superstep API with HAMA-546
## Improve Superstep API or rather BSP API with following features:
### deregister/close (empty the BSP slot)
### relieve from sync .. the task runs but would not sync anymore
> Fault Tolerant Job Processing
> -----------------------------
>
> Key: HAMA-505
> URL: https://issues.apache.org/jira/browse/HAMA-505
> Project: Hama
> Issue Type: Umbrella
> Reporter: Thomas Jungblut
>
> This umbrella summarizes all issues related with checkpointing and task
> restarting to archieve fault tolerance on the job level.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira