+1 It's good if we have an umbrella jira so we can track it easier. Failure detection (HAMA-370) was already done and tested on my machines previously.
First point in HAMA-440 is not needed because it has been integrated into bsp task. On 3 February 2012 09:38, Edward J. Yoon <[email protected]> wrote: > We also can separate the issue into two parts: 1) cluster high > availability and 2) fault tolerant job processing. Only HAMA-370 is > related with 1). > > On Fri, Feb 3, 2012 at 10:23 AM, Edward J. Yoon <[email protected]> wrote: >> +1 >> >> On Thu, Feb 2, 2012 at 8:39 PM, Thomas Jungblut >> <[email protected]> wrote: >>> Hey, >>> >>> I had a bit of time to go through the jira issues and sort out several >>> things related to Fault Tolerance. >>> >>> Here are my results: >>> >>> Fault Tolerance in Hama (all jiras related): >>> >>> [HAMA-199] Add fault tolerance to BSPPeer < CLOSE, too generic >>> [HAMA-445] Make configurable checkpointing >>> [HAMA-440] Features required in recovery procedure. >>> [HAMA-498] BSPTask should periodically ping its parent. >>> >>> Then I have splitted this in two main parts, "Detect Failure" and "Solve >>> Failure": >>> >>> Detect Failure: >>> [HAMA-370] Failure detector for Hama < Nearly complete? >>> [HAMA-498] BSPTask should periodically ping its parent. >>> >>> Solve Failure: >>> [HAMA-445] Make configurable checkpointing >>>> TODO: >>>> Groom needs functionality to restart a task >>>> BSPMaster needs functionality to restart a groom >>> >>> Also here is MISC, which is not strongly related. >>> >>> MISC: >>> [HAMA-445] Make configurable checkpointing >>> [HAMA-440] Features required in recovery procedure. >>>> TODO mainly discussion: >>>> New BSP "interface", with a chaining of supersteps to make restarting >>> tasks more simpler (contained in 440) >>> >>> >>> Let's make an umbrella jira for this larger task and close 199, since this >>> is way too generic and too old. >>> We should also split 440, because it combines too much unrelated things >>> together. >>> >>> Also "Lin" has assigned the majority of them. What is your progress? And do >>> you mind splitting these? >>> >>> [LINKS] >>> https://issues.apache.org/jira/browse/HAMA-440 >>> https://issues.apache.org/jira/browse/HAMA-119 >>> https://issues.apache.org/jira/browse/HAMA-445 >>> https://issues.apache.org/jira/browse/HAMA-440 >>> https://issues.apache.org/jira/browse/HAMA-370 >>> https://issues.apache.org/jira/browse/HAMA-498 >>> >>> -- >>> Thomas Jungblut >>> Berlin <[email protected]> >> >> >> >> -- >> Best Regards, Edward J. Yoon >> @eddieyoon > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon
