Hi all, I'm working on a machine learning project with Hadoop, in which I use the Mahout matrix package.
My application effectively is an instance of Gibbs sampling, therefore iterative. One iteration consists of 9 MR jobs, with some dependencies between them, and some code executed between them. I need to iterate over this thousands of times. At present an iteration takes around 7 minutes (I run everything on Amazon Elastic MapReduce), which makes it prohibitive for thousands of iterations. The bulk of the time is eaten up by Hadoop overhead, not by proper number-crunching. Of the 9 MR jobs comprising an iteration, 6 jobs take (map on) my training data, which consists of around 50 000 Wall Street Journal sentences, ie a rather small dataset. 2 jobs maps on cells of a 200*200 matrix, the last one on 50 000 words. I'm aware that generally speaking, Hadoop might not be adapted to this sort of job, because of the overhead involved. What better options come to mind to distribute an iterative computation (with "mainstream", possibly open source, software) ? Here are the optimizations I am thinking of, in my present setting: - run some jobs in parallel, by using JobControl. One issue I have here is that I don't know how to have the code in-between MR jobs executed by the JobControl-ler. I don't want to create fake MR jobs because that would only create more overhead. Is there a solution you know of ? - avoid constructing writables everytime I need to collect one (tip 6 from http://www.cloudera.com/blog/2009/12/17/7-tips-for-improving-mapreduce-performance/ ) - using the Mahout matrix package, I need to change the cardinality of matrices and vectors (eg, by appending an element/row, or removing one) -- I'd like to do this in-place instead of creating a new 200*200 matrix, is there some way to do that ? - run the garbage collector often, to avoid running low on memory, which prevents the Hadoop shuffles from running in memory and has them spill to disk - use map-only jobs (ie using the default IdentityReducer) where possible, and setting conf.setNumReducers(0) in these cases (my sentence data doesn't need merging/sorting) - JVM reuse is not accessible in Hadoop 0.18.3, which is in use on AEMR - write my own combiner to speed up the shuffle phase (not sure what I can achieve) In Mahout, the Dirichlet and LDA packages follow this iterative pattern, albeit with only one MR job per iteration, not 9. Can you give me some advice from the experience you have with such iterative MR jobs ? Any places in the Mahout code I should read ? Any optimization I'm not thinking of, tuning I should consider/check for ? If someone would like to look at my code, I would be happy to share it. Thanks ! Sebastien
