[jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

dhruba borthakur (JIRA) Tue, 08 Jul 2008 11:09:23 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611726#action_12611726
 ]


dhruba borthakur commented on HADOOP-3585:
------------------------------------------

Cool stuff!

1. It would be really nice to be able to deploy this without changing the 
namenode/datanode. One option would be to manually start your scheduler (that 
looks for the next report to be collected) and then runs a map-reduce job to 
collect the statistics. Is this possible using your current code?

2. Regarding the format of the serialized logs, can we use an existing 
serialization format rather than inventing another one? One option would be to 
store them as Java properties (name, value pairs) and then serialize them using 
Java serialization. Another option would be to use Hadoop recordio 
(org.apache.hadoop.record.*)

3. Instead of calling it the failmon package, a better name could be 
logcollector or something more general. The logs could be used to detect 
failures, analyze performance of specific machines, correlate events of one 
machine with another, etc. In the same vein, it might make sense to rename all 
configurable property names to the form "logcollector.nic.list, 
"logcollector.sensors.interval", etc.etc.

4. What happens when the framework tries to upload a file into HDFDS but the 
HDFS file already exists?



> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, 
> FailMon_Package_descrip.html, HADOOP-3585.patch
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters 
> running Hadoop/HDFS. We are working on a framework that will enable nodes to 
> identify failures on their hardware using the Hadoop log, the system log and 
> various OS hardware diagnosing utilities. The implementation details are not 
> very clear, but you can see a draft of our design in the attached document. 
> We are pretty interested in Hadoop and system logs from failed machines, so 
> if you are in possession of such, you are very welcome to contribute them; 
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document 
> failmon.doc. More details will follow in a later post.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Reply via email to