Hey, I had a bit of time to go through the jira issues and sort out several things related to Fault Tolerance.
Here are my results: Fault Tolerance in Hama (all jiras related): [HAMA-199] Add fault tolerance to BSPPeer < CLOSE, too generic [HAMA-445] Make configurable checkpointing [HAMA-440] Features required in recovery procedure. [HAMA-498] BSPTask should periodically ping its parent. Then I have splitted this in two main parts, "Detect Failure" and "Solve Failure": Detect Failure: [HAMA-370] Failure detector for Hama < Nearly complete? [HAMA-498] BSPTask should periodically ping its parent. Solve Failure: [HAMA-445] Make configurable checkpointing > TODO: > Groom needs functionality to restart a task > BSPMaster needs functionality to restart a groom Also here is MISC, which is not strongly related. MISC: [HAMA-445] Make configurable checkpointing [HAMA-440] Features required in recovery procedure. > TODO mainly discussion: > New BSP "interface", with a chaining of supersteps to make restarting tasks more simpler (contained in 440) Let's make an umbrella jira for this larger task and close 199, since this is way too generic and too old. We should also split 440, because it combines too much unrelated things together. Also "Lin" has assigned the majority of them. What is your progress? And do you mind splitting these? [LINKS] https://issues.apache.org/jira/browse/HAMA-440 https://issues.apache.org/jira/browse/HAMA-119 https://issues.apache.org/jira/browse/HAMA-445 https://issues.apache.org/jira/browse/HAMA-440 https://issues.apache.org/jira/browse/HAMA-370 https://issues.apache.org/jira/browse/HAMA-498 -- Thomas Jungblut Berlin <[email protected]>
