Hi fellow Hadoop users and developers, I am a third year PhD student at the University of Illinois and am working on improving workload management and scheduling in Hadoop. I have tested some of my ideas on synthetic workloads, GridMix, hadoop-examples and a few of my own applications. I am looking for real workloads that are executed in the industry. Specifically, I am interested in the job logs (stored by default on the JobTracker) of real workloads. If people are concerned about the confidentiality of the application, I would like to mention that these logs contain very little information about the processed data or the application itself. Anonymizing the job names (and their submission times, etc.) would not be too much of a problem.
I would love to collaborate with folks from the industry in understanding these workloads. I sincerely hope that the research that I am conducting will benefit everybody. Thanks a lot. -Abhishek.
