[jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Prasenjit Sarkar (JIRA) Tue, 05 Aug 2008 09:17:43 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619943#action_12619943
 ]


Prasenjit Sarkar commented on HADOOP-3585:
------------------------------------------

Comment from Mac Yang:

Mac Yang <[EMAIL PROTECTED]> wrote on 08/05/2008 09:05:28 AM:

> 
> Hi Prasenjit,
> 
> I completely agree that we should check in both projects to facilitate
> getting feedback from a wider audience. And we will be happy to work
> together with you to make that happen.
> 
> That said, as Jerome and Ariel have pointed out, there are several areas
> where it makes a lot of sense for FailMon and Chukwa to integrate /
> interoperate (data source, HDFS storage and M/R based analytics for
> example).
> 
> While it shouldn't be a blocker for anything, I think it will be benefitial
> for everyone if we could figure out a way to align our resources and take
> advantage of the great synergy between FailMon and Chukwa.
> 
> Thanks,
> Mac
>  
> 
> 
> On 8/4/08 2:32 PM, "Dhruba Borthakur" <[EMAIL PROTECTED]> wrote:
> 
> > Hi Prasenjit,
> > 
> > All thanks to you and Ioannis for developing FailMon.
> > 
> > It would be really nice if somebody from the Chukwa team can provide
> > feedback on the FailMon package, especially whether it *is* compatible
> > with Chukwa. It would be  good to hear Mac's comments on whether these
> > two approaches solve the same problem or how they can be complimentary
> > to one another.
> > 
> > thanks
> > dhruba
> > 
> > On Fri, Aug 1, 2008 at 4:10 PM, Prasenjit Sarkar
> > <[EMAIL PROTECTED]> wrote:
> >> 
> >> Hi,
> >> 
> >> As we discussed in our last meeting, we have uploaded the latest version of
> >> FailMon (and some documentation) to JIRA (HADOOP-3585). If you have some
> >> time to review it, we would be very interested to hear your comments and
> >> suggestions before it gets committed. Dhruba has agreed to committhe patch
> >> as soon as your team gives it a positive review.  In the short term,
> >> however, we would like different people/companies to start deploying
> >> FailMon as soon as possible; to that end we need to commit it to the
> >> repository as soon as possible.
> >> 
> >> We also believe that you should commit the Chukwa code and together we can
> >> get valuable feedback that can determine the direction of Chukwa and
> >> FailMon. In the interim, we await your support for the commit process for
> >> FailMon.
> >> 
> >> Regards,
> >> 
> >> Prasenjit Sarkar
> >> RSM and Manager, Storage Analytics and Resiliency
> >> Master Inventor
> >> IBM Almaden Storage Systems Research
> >> 
> >> 
> 


> Hardware Failure Monitoring in large clusters running Hadoop/HDFS
> -----------------------------------------------------------------
>
>                 Key: HADOOP-3585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3585
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Linux
>            Reporter: Ioannis Koltsidas
>            Priority: Minor
>         Attachments: FailMon-standalone.zip, failmon.pdf, failmon.pdf, 
> failmon2.pdf, FailMon_Package_descrip.html, FailMon_QuickStart.html, 
> HADOOP-3585.patch, HADOOP-3585.patch
>
>   Original Estimate: 480h
>  Remaining Estimate: 480h
>
> At IBM we're interested in identifying hardware failures on large clusters 
> running Hadoop/HDFS. We are working on a framework that will enable nodes to 
> identify failures on their hardware using the Hadoop log, the system log and 
> various OS hardware diagnosing utilities. The implementation details are not 
> very clear, but you can see a draft of our design in the attached document. 
> We are pretty interested in Hadoop and system logs from failed machines, so 
> if you are in possession of such, you are very welcome to contribute them; 
> they would be of great value for hardware failure diagnosing.
> Some details about our design can be found in the attached document 
> failmon.doc. More details will follow in a later post.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3585) Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Reply via email to