Hi!

My colleague and I have implemented a logging system that collects reports
about Hadoop network traffic in a centralized "Statistic Server". We
collect information about Mapper Inputs, Reducer Inputs and HDFS Writes at
the transfer level, rather than the total number of bytes per task (which
is what counters do currently). We originally aimed this at building a
system which would be able to keep track of network performance in the
cluster in real-time so that scheduling adjustments can be made on the fly
(hence a centralized "Statistic Server" was created, but the system can
also be easily used to log them locally on each machine by adjusting the
XML configuration files). We eventually used this system for investigating
the effects of network speed on job running time, particularly in the
context of clusters deployed across the Internet.

We would like to gauge interest in the Hadoop community in this feature, as
we would like to contribute this to the project. It is, mostly, aimed at
research users (those who use Hadoop as a research platform, and also those
who research the workings and performance of Hadoop itself - We are of the
second category ourselves), although it might also be used by people who
wish to analyze the data flow of the various stages of Hadoop computation
in their jobs. In turn, this should enable a new way to discover possible
optimizations for jobs.

This has no effect on Hadoop when disabled, which, by default, it will be.

Please let us know what/if we should elaborate further, if any interest
exists.

Thanks,
Lev Faerman and Aviad Pines.

Reply via email to