[ 
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619174#action_12619174
 ] 

Ioannis Koltsidas commented on HADOOP-3585:
-------------------------------------------

Thanks for your comment, Otis. By "decoupled" I mean that it is not started 
directly by a Hadoop component, as it was in the initial version (then, it was 
started by NameNode.java, DataNode.java). However, since FailMon not only uses 
Hadoop, but also is tailored for Hadoop log collection, we believe it is a good 
idea to be part of the project (since this will make it more visible to people 
running large clusters, since most of them use Hadoop).

In order to make ti more visible (and more usable in the first place ;), we 
plan to set up on a website/wiki for FailMon, where we will upload all info and 
documentation... 

> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf, 
> failmon2.pdf, FailMon_Package_descrip.html, FailMon_QuickStart.html, 
> HADOOP-3585.patch, HADOOP-3585.patch
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters 
> running Hadoop/HDFS. We are working on a framework that will enable nodes to 
> identify failures on their hardware using the Hadoop log, the system log and 
> various OS hardware diagnosing utilities. The implementation details are not 
> very clear, but you can see a draft of our design in the attached document. 
> We are pretty interested in Hadoop and system logs from failed machines, so 
> if you are in possession of such, you are very welcome to contribute them; 
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document 
> failmon.doc. More details will follow in a later post.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to