[jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Prasenjit Sarkar (JIRA) Tue, 05 Aug 2008 09:08:38 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619934#action_12619934
 ]


Prasenjit Sarkar commented on HADOOP-3585:
------------------------------------------

Attached are a couple of threads of email conversation pertinent to this issue, 
in summary there is a strong interest in committing both the FailMon and Chukwa 
projects and awaiting user feedback.

Ariel Rabkin <[EMAIL PROTECTED]> wrote on 08/04/2008 03:23:04 PM:

> As near as I could gather from the failmon code --
> 
> Ideally, the failmon data collection plugins ("monitors") would be 
> Chukwa adaptors.  The abstractions are fairly close.  Provided that 
> failmon isn't going to be patching away too intensively in the next 
> month, probably the best thing to do would be commit both, and merge later.  
> 
> --Ari
> 
> ----- Original Message -----
> From: Dhruba Borthakur <[EMAIL PROTECTED]>
> Date: Monday, August 4, 2008 2:32 pm
> Subject: Re: support for FailMon commit...
> To: Prasenjit Sarkar <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
> Berkeley.EDU, [EMAIL PROTECTED], [EMAIL PROTECTED], 
> [EMAIL PROTECTED], Ioannis Koltsidas <[EMAIL PROTECTED]>, Karan
> Gupta <[EMAIL PROTECTED]>
> 
> > Hi Prasenjit,
> > 
> > All thanks to you and Ioannis for developing FailMon.
> > 
> > It would be really nice if somebody from the Chukwa team can provide
> > feedback on the FailMon package, especially whether it *is* compatible
> > with Chukwa. It would be  good to hear Mac's comments on whether these
> > two approaches solve the same problem or how they can be complimentary
> > to one another.
> > 
> > thanks
> > dhruba
> > 
> > On Fri, Aug 1, 2008 at 4:10 PM, Prasenjit Sarkar
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > >
> > > As we discussed in our last meeting, we have uploaded the latest 
> > version of
> > > FailMon (and some documentation) to JIRA (HADOOP-3585). If you have 
> > some
> > > time to review it, we would be very interested to hear your comments 
> > and
> > > suggestions before it gets committed. Dhruba has agreed to commit 
> > the patch
> > > as soon as your team gives it a positive review.  In the short term,
> > > however, we would like different people/companies to start deploying
> > > FailMon as soon as possible; to that end we need to commit it to the
> > > repository as soon as possible.
> > >
> > > We also believe that you should commit the Chukwa code and together 
> > we can
> > > get valuable feedback that can determine the direction of Chukwa and
> > > FailMon. In the interim, we await your support for the commit 
> > process for
> > > FailMon.
> > >
> > > Regards,
> > >
> > > Prasenjit Sarkar
> > > RSM and Manager, Storage Analytics and Resiliency
> > > Master Inventor
> > > IBM Almaden Storage Systems Research
> > >
> > >

and

Prasenjit Sarkar/Almaden/IBM wrote on 08/04/2008 03:19:45 PM:

> Jerome,
> 
> I appreciate your analysis of the integration scenarios. Taking a 
> step back, we think that both Chukwa and FailMon provide interesting
> value propositions independent of each other. For example, we have 
> had requests from a few groups wanting to use FailMon independently 
> as a quick cluster health post-processor. I'm sure that Chukwa has a
> similar user community. In that vein, I would not like the value 
> proposition of these two complementary projects be diluted by the 
> integration discussion.
> 
> So, I would vote for a quick commital for both projects followed by 
> integration discussions moderated by Hadoop commiters using feedback
> from Chukwa/FailMon users.
> 
> I hope this is reasonable,
> 
> Regards,
> 
> Prasenjit Sarkar
> RSM and Manager, Storage Analytics and Resiliency
> Master Inventor
> IBM Almaden Storage Systems Research
> 
> Jerome Boulon <[EMAIL PROTECTED]> 
> 08/04/2008 10:11 AM
> 
> To
> 
> Prasenjit Sarkar <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, 
> Ioannis Koltsidas/Almaden/[EMAIL PROTECTED], Karan Gupta/Almaden/[EMAIL 
> PROTECTED], 
> <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, Runping Qi 
> <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, Mac Yang 
> <[EMAIL PROTECTED]>
> 
> cc
> 
> Subject
> 
> FailMon - Chukwa integration
> 
> Hi,
> I have take a look at FailMon and here how we can integrate it to Chukwa.
> Basically there's 3 entry points in Chukwa:
> 
> 1- At the adaptor level (inject data)
> 2- At the Demux level (Data analysis)
> 3- Using the archive.
> 
> 1- Running FailMon at the adaptor level will prevent anyone to use the real
> data. So this should not be used in the general case.
> 
> 2- It's possible to run FailMon as a Demux processor and output exactly what
> we want and that would have been my suggestion but FailMon is not intended
> to be used directly by the company that produce the output (at least for
> now) so I would prefer not to use FailMon there since we're planning to run
> critical processors and adding any latency here may become an issue.
> 
> 3- So my recommendation is to use all Chukwa's archives as input for
> FailMon. The main advantage is that all the data is group together in one or
> more big Sequence files that can be easily processed using M/R and since
> it's an offline post-processing the impact on the production's cluster could
> be easily controlled.
> 
> /Jerome.
> 


> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf, 
> failmon2.pdf, FailMon_Package_descrip.html, FailMon_QuickStart.html, 
> HADOOP-3585.patch, HADOOP-3585.patch
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters 
> running Hadoop/HDFS. We are working on a framework that will enable nodes to 
> identify failures on their hardware using the Hadoop log, the system log and 
> various OS hardware diagnosing utilities. The implementation details are not 
> very clear, but you can see a draft of our design in the attached document. 
> We are pretty interested in Hadoop and system logs from failed machines, so 
> if you are in possession of such, you are very welcome to contribute them; 
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document 
> failmon.doc. More details will follow in a later post.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Reply via email to