Re: Feature Proposal: We already have something coded for our research purposes and would like to contribute.

Lev Fri, 19 Jul 2013 04:58:27 -0700

I threw together a couple of pages on our research group's wiki site with
some extra information. We didn't really think we'd garner interest (and we
originally planned to use it as a project-internal tool/version), so the
wiki is somewhat basic, but I am happy to answer any questions people might
have.


Currently we have an implementation for (the rather old-ish) Cloudera CDH3
distribution of Hadoop, but as we have ported it easily between this and
0.21 (which also old by now) before, it should not be hard to get it ported
to the current Trunk. We would have to create unit tests up to the project
standards, obviously.

Here is some more info:
http://www.cs.huji.ac.il/wikis/MediaWiki/lawa/index.php/Hadoop_Kelvin

Thanks!

On Fri, Jul 19, 2013 at 12:26 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:

> This looks interesting.
> Do you have any blog or wikipage where I can read about your approach.
>
> This will surely be useful on planning network capacities at least for me.
>
>
> On Fri, Jul 19, 2013 at 2:45 PM, Lev <l...@vrsob.com> wrote:
>
> > Hi!
> >
> > My colleague and I have implemented a logging system that collects
> reports
> > about Hadoop network traffic in a centralized "Statistic Server". We
> > collect information about Mapper Inputs, Reducer Inputs and HDFS Writes
> at
> > the transfer level, rather than the total number of bytes per task (which
> > is what counters do currently). We originally aimed this at building a
> > system which would be able to keep track of network performance in the
> > cluster in real-time so that scheduling adjustments can be made on the
> fly
> > (hence a centralized "Statistic Server" was created, but the system can
> > also be easily used to log them locally on each machine by adjusting the
> > XML configuration files). We eventually used this system for
> investigating
> > the effects of network speed on job running time, particularly in the
> > context of clusters deployed across the Internet.
> >
> > We would like to gauge interest in the Hadoop community in this feature,
> as
> > we would like to contribute this to the project. It is, mostly, aimed at
> > research users (those who use Hadoop as a research platform, and also
> those
> > who research the workings and performance of Hadoop itself - We are of
> the
> > second category ourselves), although it might also be used by people who
> > wish to analyze the data flow of the various stages of Hadoop computation
> > in their jobs. In turn, this should enable a new way to discover possible
> > optimizations for jobs.
> >
> > This has no effect on Hadoop when disabled, which, by default, it will
> be.
> >
> > Please let us know what/if we should elaborate further, if any interest
> > exists.
> >
> > Thanks,
> > Lev Faerman and Aviad Pines.
> >
>
>
>
> --
> Nitin Pawar
>

Re: Feature Proposal: We already have something coded for our research purposes and would like to contribute.

Reply via email to