[jira] Updated: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Ioannis Koltsidas (JIRA) Wed, 30 Jul 2008 15:43:54 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ioannis Koltsidas updated HADOOP-3585:
--------------------------------------

    Attachment: HADOOP-3585.patch

Release of FailMon as a contrib project, with some additional features and many 
bug fixes. Please refer to the user manual (failmon2.pdf) for a complete 
description and instructions for deployment and execution of FailMon, 
especially Section 4. File FailMon_QuickStart.html provides a guide to quickly 
set up and run FailMon. Here is the summary of changes we have made since the 
previous patch:

- FailMon is now a contrib project and its code is decoupled from the Hadoop 
core. Only the javadoc target in the hadoop-core-trunk/build.xml file has been 
changed to account for the FailMon javadoc. Everything else lies under 
src/contrib/failmon.

- Scheduling of monitoring jobs is now done in an ad-hoc fashion by one or more 
"scheduler" nodes. Execution of FailMon is thus independent of Hadoop and can 
be started/stopped arbitrarily. Also, it can be run with arbitrary user 
permissions (it doesn't have to be run by the user that runs hadoop on nodes). 
It can also be run selectively on nodes and even at times when Hadoop is not 
running.

- We have added a mechanism for concatenating all HDFS files created by FailMon 
into a single HDFS file (to reduce metadata overhead at the namenode). A limit 
on the maximum number of HDFS files created by FailMon can be set via the 
configuration files.

- We use the Commons Logging API to log messages and stack traces.

- The user can now specify entire directories with log files to be parsed. Note 
that FailMon will now collect log files no matter how old they are, and upload 
their entries into HDFS.

- We have made some bookkeeping information about the state of log file parsing 
persistent locally on nodes. For each log file ever opened on a node, we store 
its first log entry and the byte offset of the last entry parsed. The former 
enables FailMon to detect log file rotation, while the latter is used to resume 
parsing from the last entry parsed.

- Added the ant tar target, which packages FailMon in a jar file and inserts it 
into an archive (with all required libraries and configuration files), so that 
it can be deployed and run independently of Hadoop.


> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf, 
> FailMon_Package_descrip.html, HADOOP-3585.patch, HADOOP-3585.patch
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters 
> running Hadoop/HDFS. We are working on a framework that will enable nodes to 
> identify failures on their hardware using the Hadoop log, the system log and 
> various OS hardware diagnosing utilities. The implementation details are not 
> very clear, but you can see a draft of our design in the attached document. 
> We are pretty interested in Hadoop and system logs from failed machines, so 
> if you are in possession of such, you are very welcome to contribute them; 
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document 
> failmon.doc. More details will follow in a later post.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Reply via email to