[
https://issues.apache.org/jira/browse/HAMA-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009548#comment-13009548
]
ChiaHung Lin commented on HAMA-370:
-----------------------------------
Indeed, the implementation in patch also contains heartbeat mechanism - the
monitored process periodically sending heartbeat.
The different is a conventional heartbeat failure detector has a fixed timeout.
The phi accrual failure detector decomposes functions into different components
(monitoring, interpretation, etc.); with a suspicion level (not binary trust or
suspect value output) exposed so that different applications equipped with its
own interpreter can use the output value for further decision. For instance, a
master may allocate urgent tasks to workers which have lower suspicion level.
Or the monitoring process may interpret according to its business logic in
determining if monitored process has crashed.
Although a task failure can be solved with a restart, the difficulty lies in
the distinguished between a crash/ failure process and a very slow one. In
addition, in the future if the project needs the feature of fault tolerant
between bspmasters, a failure detection service is required.
> Failure detector for Hama
> -------------------------
>
> Key: HAMA-370
> URL: https://issues.apache.org/jira/browse/HAMA-370
> Project: Hama
> Issue Type: New Feature
> Components: bsp
> Affects Versions: 0.3.0
> Environment: GNU/ Debian, JDK 1.6.0_22-b04
> Reporter: ChiaHung Lin
> Assignee: ChiaHung Lin
> Labels: patch
> Fix For: 0.3.0
>
> Attachments: HAMA-370.patch, HAMA-370.patch
>
>
> In order to enable fault tolerance service, BSPMaster requires to have
> ability in determining GroomServers' status. This generally can be achieved
> through failure detector. The attached file contains source for such patch.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira