Re: Anonymized job history logs dataset for R&D

arjun Sun, 24 Aug 2014 13:49:11 -0700

Hi Bernd,

I've been working on data placement in hadoop and wanted to test out howthe changes I've made affect the cluster performance. This projectfollows research in the area of energy efficiency in hadoop clusters.Papers like

"Energy efficiency for large-scale MapReduce workloads with significantinteractive analysis", at http://dl.acm.org/citation.cfm?id=2168842

and "Energy Management for MapReduce Clusters" athttp://vldb.org/pvldb/vldb2010/papers/R11.pdf

I think it is best to test changes made to the Hadoop framework usingworkloads that mimic real-work environments. To my understanding, thejob history logs written and maintained by the hadoop framework shouldcontain enough information for someone to recreate the originalworkload. Tools like Apache Rumen and GridMix are useful for recreatingworkloads, but need job history logs as input.

The authors of the first paper I mention are from Facebook, Cloudera,and Berkeley, so they were able to do so by taking old logs from FB andCloudera. However, there is no public dataset that mimics the workloadof production level clusters. At least I can't find one.

It would be good if there was one so that researchers or developers fromanywhere could see how the changes they are working on affect the realworld performance of the cluster, without resorting to complicatedsimulations.


Hope this clears things up a bit more.

Regards,

Arjun


On 08/24/2014 12:50 PM, Bernd wrote:


Hello Arjun

You might want to mention what project you are talking about? Whatkind of Jobs and Framework? Do you mean any kind of logs? In which format?


Greetings
Bernd

Am 23.08.2014 19:19 schrieb "Arjun Bakshi"<bakshi...@buckeyemail.osu.edu <mailto:bakshi...@buckeyemail.osu.edu>>:


    Hi,

    Would it be possible for people to contribute towards a dataset of
    job history logs that model different real-world/ production
    environments for research purposes? It would make it easier to run
    simulations or test changes to the framework for people who don't
    have access to large company clusters. User or data specific
    details like user names, file names etc. can be anonymized.

    If this isn't the list to ask for this, please let me know what
    would be a good place for this request.

    Thank you,

    Arjun

Re: Anonymized job history logs dataset for R&D

Reply via email to