This looks interesting. Do you have any blog or wikipage where I can read about your approach.
This will surely be useful on planning network capacities at least for me. On Fri, Jul 19, 2013 at 2:45 PM, Lev <l...@vrsob.com> wrote: > Hi! > > My colleague and I have implemented a logging system that collects reports > about Hadoop network traffic in a centralized "Statistic Server". We > collect information about Mapper Inputs, Reducer Inputs and HDFS Writes at > the transfer level, rather than the total number of bytes per task (which > is what counters do currently). We originally aimed this at building a > system which would be able to keep track of network performance in the > cluster in real-time so that scheduling adjustments can be made on the fly > (hence a centralized "Statistic Server" was created, but the system can > also be easily used to log them locally on each machine by adjusting the > XML configuration files). We eventually used this system for investigating > the effects of network speed on job running time, particularly in the > context of clusters deployed across the Internet. > > We would like to gauge interest in the Hadoop community in this feature, as > we would like to contribute this to the project. It is, mostly, aimed at > research users (those who use Hadoop as a research platform, and also those > who research the workings and performance of Hadoop itself - We are of the > second category ourselves), although it might also be used by people who > wish to analyze the data flow of the various stages of Hadoop computation > in their jobs. In turn, this should enable a new way to discover possible > optimizations for jobs. > > This has no effect on Hadoop when disabled, which, by default, it will be. > > Please let us know what/if we should elaborate further, if any interest > exists. > > Thanks, > Lev Faerman and Aviad Pines. > -- Nitin Pawar