[
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619943#action_12619943
]
Prasenjit Sarkar commented on HADOOP-3585:
------------------------------------------
Comment from Mac Yang:
Mac Yang <[EMAIL PROTECTED]> wrote on 08/05/2008 09:05:28 AM:
>
> Hi Prasenjit,
>
> I completely agree that we should check in both projects to facilitate
> getting feedback from a wider audience. And we will be happy to work
> together with you to make that happen.
>
> That said, as Jerome and Ariel have pointed out, there are several areas
> where it makes a lot of sense for FailMon and Chukwa to integrate /
> interoperate (data source, HDFS storage and M/R based analytics for
> example).
>
> While it shouldn't be a blocker for anything, I think it will be benefitial
> for everyone if we could figure out a way to align our resources and take
> advantage of the great synergy between FailMon and Chukwa.
>
> Thanks,
> Mac
>
>
>
> On 8/4/08 2:32 PM, "Dhruba Borthakur" <[EMAIL PROTECTED]> wrote:
>
> > Hi Prasenjit,
> >
> > All thanks to you and Ioannis for developing FailMon.
> >
> > It would be really nice if somebody from the Chukwa team can provide
> > feedback on the FailMon package, especially whether it *is* compatible
> > with Chukwa. It would be good to hear Mac's comments on whether these
> > two approaches solve the same problem or how they can be complimentary
> > to one another.
> >
> > thanks
> > dhruba
> >
> > On Fri, Aug 1, 2008 at 4:10 PM, Prasenjit Sarkar
> > <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi,
> >>
> >> As we discussed in our last meeting, we have uploaded the latest version of
> >> FailMon (and some documentation) to JIRA (HADOOP-3585). If you have some
> >> time to review it, we would be very interested to hear your comments and
> >> suggestions before it gets committed. Dhruba has agreed to committhe patch
> >> as soon as your team gives it a positive review. In the short term,
> >> however, we would like different people/companies to start deploying
> >> FailMon as soon as possible; to that end we need to commit it to the
> >> repository as soon as possible.
> >>
> >> We also believe that you should commit the Chukwa code and together we can
> >> get valuable feedback that can determine the direction of Chukwa and
> >> FailMon. In the interim, we await your support for the commit process for
> >> FailMon.
> >>
> >> Regards,
> >>
> >> Prasenjit Sarkar
> >> RSM and Manager, Storage Analytics and Resiliency
> >> Master Inventor
> >> IBM Almaden Storage Systems Research
> >>
> >>
>
> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
> Key: HADOOP-3585
> URL: https://issues.apache.org/jira/browse/HADOOP-3585
> Project: Hadoop Core
> Issue Type: New Feature
> Environment: Linux
> Reporter: Ioannis Koltsidas
> Priority: Minor
> Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf,
> failmon2.pdf, FailMon_Package_descrip.html, FailMon_QuickStart.html,
> HADOOP-3585.patch, HADOOP-3585.patch
>
> Original Estimate: 480h
> Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters
> running Hadoop/HDFS. We are working on a framework that will enable nodes to
> identify failures on their hardware using the Hadoop log, the system log and
> various OS hardware diagnosing utilities. The implementation details are not
> very clear, but you can see a draft of our design in the attached document.
> We are pretty interested in Hadoop and system logs from failed machines, so
> if you are in possession of such, you are very welcome to contribute them;
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document
> failmon.doc. More details will follow in a later post.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.