[
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ioannis Koltsidas updated HADOOP-3585:
--------------------------------------
Attachment: HADOOP-3585.patch
Release of FailMon as a contrib project, with some additional features and many
bug fixes. Please refer to the user manual (failmon2.pdf) for a complete
description and instructions for deployment and execution of FailMon,
especially Section 4. File FailMon_QuickStart.html provides a guide to quickly
set up and run FailMon. Here is the summary of changes we have made since the
previous patch:
- FailMon is now a contrib project and its code is decoupled from the Hadoop
core. Only the javadoc target in the hadoop-core-trunk/build.xml file has been
changed to account for the FailMon javadoc. Everything else lies under
src/contrib/failmon.
- Scheduling of monitoring jobs is now done in an ad-hoc fashion by one or more
"scheduler" nodes. Execution of FailMon is thus independent of Hadoop and can
be started/stopped arbitrarily. Also, it can be run with arbitrary user
permissions (it doesn't have to be run by the user that runs hadoop on nodes).
It can also be run selectively on nodes and even at times when Hadoop is not
running.
- We have added a mechanism for concatenating all HDFS files created by FailMon
into a single HDFS file (to reduce metadata overhead at the namenode). A limit
on the maximum number of HDFS files created by FailMon can be set via the
configuration files.
- We use the Commons Logging API to log messages and stack traces.
- The user can now specify entire directories with log files to be parsed. Note
that FailMon will now collect log files no matter how old they are, and upload
their entries into HDFS.
- We have made some bookkeeping information about the state of log file parsing
persistent locally on nodes. For each log file ever opened on a node, we store
its first log entry and the byte offset of the last entry parsed. The former
enables FailMon to detect log file rotation, while the latter is used to resume
parsing from the last entry parsed.
- Added the ant tar target, which packages FailMon in a jar file and inserts it
into an archive (with all required libraries and configuration files), so that
it can be deployed and run independently of Hadoop.
> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
> Key: HADOOP-3585
> URL: https://issues.apache.org/jira/browse/HADOOP-3585
> Project: Hadoop Core
> Issue Type: New Feature
> Environment: Linux
> Reporter: Ioannis Koltsidas
> Priority: Minor
> Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf,
> FailMon_Package_descrip.html, HADOOP-3585.patch, HADOOP-3585.patch
>
> Original Estimate: 480h
> Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters
> running Hadoop/HDFS. We are working on a framework that will enable nodes to
> identify failures on their hardware using the Hadoop log, the system log and
> various OS hardware diagnosing utilities. The implementation details are not
> very clear, but you can see a draft of our design in the attached document.
> We are pretty interested in Hadoop and system logs from failed machines, so
> if you are in possession of such, you are very welcome to contribute them;
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document
> failmon.doc. More details will follow in a later post.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.