Hi Bernd,

I've been working on data placement in hadoop and wanted to test out how the changes I've made affect the cluster performance. This project follows research in the area of energy efficiency in hadoop clusters. Papers like

"Energy efficiency for large-scale MapReduce workloads with significant interactive analysis", at http://dl.acm.org/citation.cfm?id=2168842

and "Energy Management for MapReduce Clusters" at http://vldb.org/pvldb/vldb2010/papers/R11.pdf

I think it is best to test changes made to the Hadoop framework using workloads that mimic real-work environments. To my understanding, the job history logs written and maintained by the hadoop framework should contain enough information for someone to recreate the original workload. Tools like Apache Rumen and GridMix are useful for recreating workloads, but need job history logs as input.

The authors of the first paper I mention are from Facebook, Cloudera, and Berkeley, so they were able to do so by taking old logs from FB and Cloudera. However, there is no public dataset that mimics the workload of production level clusters. At least I can't find one.

It would be good if there was one so that researchers or developers from anywhere could see how the changes they are working on affect the real world performance of the cluster, without resorting to complicated simulations.

Hope this clears things up a bit more.

Regards,

Arjun


On 08/24/2014 12:50 PM, Bernd wrote:

Hello Arjun

You might want to mention what project you are talking about? What kind of Jobs and Framework? Do you mean any kind of logs? In which format?

Greetings
Bernd

Am 23.08.2014 19:19 schrieb "Arjun Bakshi" <bakshi...@buckeyemail.osu.edu <mailto:bakshi...@buckeyemail.osu.edu>>:

    Hi,

    Would it be possible for people to contribute towards a dataset of
    job history logs that model different real-world/ production
    environments for research purposes? It would make it easier to run
    simulations or test changes to the framework for people who don't
    have access to large company clusters. User or data specific
    details like user names, file names etc. can be anonymized.

    If this isn't the list to ask for this, please let me know what
    would be a good place for this request.

    Thank you,

    Arjun


Reply via email to