Hardware Failure Monitoring in large clusters running Hadoop/HDFS
-----------------------------------------------------------------

                 Key: HADOOP-3585
                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
             Project: Hadoop Core
          Issue Type: New Feature
         Environment: Linux
            Reporter: Ioannis Koltsidas
            Priority: Minor


At IBM we're interested in identifying hardware failures on large clusters 
running Hadoop/HDFS. We are working on a framework that will enable nodes to 
identify failures on their hardware using the Hadoop log, the system log and 
various OS hardware diagnosing utilities. The implementation details are not 
very clear, but you can see a draft of our design in the attached document. We 
are pretty interested in Hadoop and system logs from failed machines, so if you 
are in possession of such, you are very welcome to contribute them; they would 
be of great value for hardware failure diagnosing.



Some details about our design can be found in the attached document 
failmon.doc. More details will follow in a later post.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to