[jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Steve Loughran (JIRA) Tue, 08 Jul 2008 09:03:22 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611671#action_12611671
 ]


Steve Loughran commented on HADOOP-3585:
----------------------------------------

Some comments from a quick look at the code

* only IBM and hitachi HDDs are logged? Is there any way to make it easily 
extensible for other SMART disks, since most SCSI HDDs have this facility
* why isn't commons logging API being used for logging messages and stack 
traces.?
* what happens when you deploy on non-unix systems?
* How do propose to test all of this. I could imagine that test data of various 
OS log files could be used in unit testing, something to push out spoof 
  files to test the live thread, and something else to parse the output via 
DFS, but didn't see that in the patch. 
* What is the lifecycle of the Monitor, the class called Executor ? I see lots 
of shutdown code in various places.
* What happens if you try and start >1 monitor in the same process, or on the 
same machine?

This is interesting, but I'd like to see it deployable as a standalone Service 
under the service code I'm putting together, rather than hidden under every 
kind of hadoop service that can be brought up, and the polling worries me. 
Others may have different opinions, 


> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, 
> FailMon_Package_descrip.html, HADOOP-3585.patch
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters 
> running Hadoop/HDFS. We are working on a framework that will enable nodes to 
> identify failures on their hardware using the Hadoop log, the system log and 
> various OS hardware diagnosing utilities. The implementation details are not 
> very clear, but you can see a draft of our design in the attached document. 
> We are pretty interested in Hadoop and system logs from failed machines, so 
> if you are in possession of such, you are very welcome to contribute them; 
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document 
> failmon.doc. More details will follow in a later post.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Reply via email to