[
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619934#action_12619934
]
Prasenjit Sarkar commented on HADOOP-3585:
------------------------------------------
Attached are a couple of threads of email conversation pertinent to this issue,
in summary there is a strong interest in committing both the FailMon and Chukwa
projects and awaiting user feedback.
Ariel Rabkin <[EMAIL PROTECTED]> wrote on 08/04/2008 03:23:04 PM:
> As near as I could gather from the failmon code --
>
> Ideally, the failmon data collection plugins ("monitors") would be
> Chukwa adaptors. The abstractions are fairly close. Provided that
> failmon isn't going to be patching away too intensively in the next
> month, probably the best thing to do would be commit both, and merge later.
>
> --Ari
>
> ----- Original Message -----
> From: Dhruba Borthakur <[EMAIL PROTECTED]>
> Date: Monday, August 4, 2008 2:32 pm
> Subject: Re: support for FailMon commit...
> To: Prasenjit Sarkar <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
> Berkeley.EDU, [EMAIL PROTECTED], [EMAIL PROTECTED],
> [EMAIL PROTECTED], Ioannis Koltsidas <[EMAIL PROTECTED]>, Karan
> Gupta <[EMAIL PROTECTED]>
>
> > Hi Prasenjit,
> >
> > All thanks to you and Ioannis for developing FailMon.
> >
> > It would be really nice if somebody from the Chukwa team can provide
> > feedback on the FailMon package, especially whether it *is* compatible
> > with Chukwa. It would be good to hear Mac's comments on whether these
> > two approaches solve the same problem or how they can be complimentary
> > to one another.
> >
> > thanks
> > dhruba
> >
> > On Fri, Aug 1, 2008 at 4:10 PM, Prasenjit Sarkar
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > >
> > > As we discussed in our last meeting, we have uploaded the latest
> > version of
> > > FailMon (and some documentation) to JIRA (HADOOP-3585). If you have
> > some
> > > time to review it, we would be very interested to hear your comments
> > and
> > > suggestions before it gets committed. Dhruba has agreed to commit
> > the patch
> > > as soon as your team gives it a positive review. In the short term,
> > > however, we would like different people/companies to start deploying
> > > FailMon as soon as possible; to that end we need to commit it to the
> > > repository as soon as possible.
> > >
> > > We also believe that you should commit the Chukwa code and together
> > we can
> > > get valuable feedback that can determine the direction of Chukwa and
> > > FailMon. In the interim, we await your support for the commit
> > process for
> > > FailMon.
> > >
> > > Regards,
> > >
> > > Prasenjit Sarkar
> > > RSM and Manager, Storage Analytics and Resiliency
> > > Master Inventor
> > > IBM Almaden Storage Systems Research
> > >
> > >
and
Prasenjit Sarkar/Almaden/IBM wrote on 08/04/2008 03:19:45 PM:
> Jerome,
>
> I appreciate your analysis of the integration scenarios. Taking a
> step back, we think that both Chukwa and FailMon provide interesting
> value propositions independent of each other. For example, we have
> had requests from a few groups wanting to use FailMon independently
> as a quick cluster health post-processor. I'm sure that Chukwa has a
> similar user community. In that vein, I would not like the value
> proposition of these two complementary projects be diluted by the
> integration discussion.
>
> So, I would vote for a quick commital for both projects followed by
> integration discussions moderated by Hadoop commiters using feedback
> from Chukwa/FailMon users.
>
> I hope this is reasonable,
>
> Regards,
>
> Prasenjit Sarkar
> RSM and Manager, Storage Analytics and Resiliency
> Master Inventor
> IBM Almaden Storage Systems Research
>
> Jerome Boulon <[EMAIL PROTECTED]>
> 08/04/2008 10:11 AM
>
> To
>
> Prasenjit Sarkar <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>,
> Ioannis Koltsidas/Almaden/[EMAIL PROTECTED], Karan Gupta/Almaden/[EMAIL
> PROTECTED],
> <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, Runping Qi
> <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, Mac Yang
> <[EMAIL PROTECTED]>
>
> cc
>
> Subject
>
> FailMon - Chukwa integration
>
> Hi,
> I have take a look at FailMon and here how we can integrate it to Chukwa.
> Basically there's 3 entry points in Chukwa:
>
> 1- At the adaptor level (inject data)
> 2- At the Demux level (Data analysis)
> 3- Using the archive.
>
> 1- Running FailMon at the adaptor level will prevent anyone to use the real
> data. So this should not be used in the general case.
>
> 2- It's possible to run FailMon as a Demux processor and output exactly what
> we want and that would have been my suggestion but FailMon is not intended
> to be used directly by the company that produce the output (at least for
> now) so I would prefer not to use FailMon there since we're planning to run
> critical processors and adding any latency here may become an issue.
>
> 3- So my recommendation is to use all Chukwa's archives as input for
> FailMon. The main advantage is that all the data is group together in one or
> more big Sequence files that can be easily processed using M/R and since
> it's an offline post-processing the impact on the production's cluster could
> be easily controlled.
>
> /Jerome.
>
> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
> Key: HADOOP-3585
> URL: https://issues.apache.org/jira/browse/HADOOP-3585
> Project: Hadoop Core
> Issue Type: New Feature
> Environment: Linux
> Reporter: Ioannis Koltsidas
> Priority: Minor
> Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf,
> failmon2.pdf, FailMon_Package_descrip.html, FailMon_QuickStart.html,
> HADOOP-3585.patch, HADOOP-3585.patch
>
> Original Estimate: 480h
> Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters
> running Hadoop/HDFS. We are working on a framework that will enable nodes to
> identify failures on their hardware using the Hadoop log, the system log and
> various OS hardware diagnosing utilities. The implementation details are not
> very clear, but you can see a draft of our design in the attached document.
> We are pretty interested in Hadoop and system logs from failed machines, so
> if you are in possession of such, you are very welcome to contribute them;
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document
> failmon.doc. More details will follow in a later post.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.