Hi Bernd,
I've been working on data placement in hadoop and wanted to test out how
the changes I've made affect the cluster performance. This project
follows research in the area of energy efficiency in hadoop clusters.
Papers like
"Energy efficiency for large-scale MapReduce workloads with significant
interactive analysis", at http://dl.acm.org/citation.cfm?id=2168842
and "Energy Management for MapReduce Clusters" at
http://vldb.org/pvldb/vldb2010/papers/R11.pdf
I think it is best to test changes made to the Hadoop framework using
workloads that mimic real-work environments. To my understanding, the
job history logs written and maintained by the hadoop framework should
contain enough information for someone to recreate the original
workload. Tools like Apache Rumen and GridMix are useful for recreating
workloads, but need job history logs as input.
The authors of the first paper I mention are from Facebook, Cloudera,
and Berkeley, so they were able to do so by taking old logs from FB and
Cloudera. However, there is no public dataset that mimics the workload
of production level clusters. At least I can't find one.
It would be good if there was one so that researchers or developers from
anywhere could see how the changes they are working on affect the real
world performance of the cluster, without resorting to complicated
simulations.
Hope this clears things up a bit more.
Regards,
Arjun
On 08/24/2014 12:50 PM, Bernd wrote:
Hello Arjun
You might want to mention what project you are talking about? What
kind of Jobs and Framework? Do you mean any kind of logs? In which format?
Greetings
Bernd
Am 23.08.2014 19:19 schrieb "Arjun Bakshi"
<bakshi...@buckeyemail.osu.edu <mailto:bakshi...@buckeyemail.osu.edu>>:
Hi,
Would it be possible for people to contribute towards a dataset of
job history logs that model different real-world/ production
environments for research purposes? It would make it easier to run
simulations or test changes to the framework for people who don't
have access to large company clusters. User or data specific
details like user names, file names etc. can be anonymized.
If this isn't the list to ask for this, please let me know what
would be a good place for this request.
Thank you,
Arjun