Gibbs' sampling can be quite difficult in a map-reduce setting because the state inherently is chained from one iteration to the next.
Parallel processing can help if you have a relatively small number of iterations, but each iteration is expensive (LDA, Dirichlet Process clustering) or if you want to use many chains. If your process mixes poorly then you can be in a world of hurt. We have been lucky so far to have very fast mixing processes (Dirichlet Process) or variational techniques that converge very quickly (LDA). If you want many iterations, then Hadoop based map-reduce may be very difficult to make efficient. The minimum job time is probably going to be 10's of seconds. Running thousands of these is going to hurt. Running tens of thousands will hurt much worse. On Tue, Jan 26, 2010 at 4:12 PM, Sebastien Bratieres <[email protected]>wrote: > Hi all, > > I'm working on a machine learning project with Hadoop, in which I use the > Mahout matrix package. > > My application effectively is an instance of Gibbs sampling, therefore > iterative. One iteration consists of 9 MR jobs, with some dependencies > between them, and some code executed between them. I need to iterate over > this thousands of times. At present an iteration takes around 7 minutes (I > run everything on Amazon Elastic MapReduce), which makes it prohibitive for > thousands of iterations. The bulk of the time is eaten up by Hadoop > overhead, not by proper number-crunching. Of the 9 MR jobs comprising an > iteration, 6 jobs take (map on) my training data, which consists of around > 50 000 Wall Street Journal sentences, ie a rather small dataset. 2 jobs > maps > on cells of a 200*200 matrix, the last one on 50 000 words. > > I'm aware that generally speaking, Hadoop might not be adapted to this sort > of job, because of the overhead involved. What better options come to mind > to distribute an iterative computation (with "mainstream", possibly open > source, software) ? > > Here are the optimizations I am thinking of, in my present setting: > - run some jobs in parallel, by using JobControl. One issue I have here is > that I don't know how to have the code in-between MR jobs executed by the > JobControl-ler. I don't want to create fake MR jobs because that would only > create more overhead. Is there a solution you know of ? > - avoid constructing writables everytime I need to collect one (tip 6 from > > http://www.cloudera.com/blog/2009/12/17/7-tips-for-improving-mapreduce-performance/ > ) > - using the Mahout matrix package, I need to change the cardinality of > matrices and vectors (eg, by appending an element/row, or removing one) -- > I'd like to do this in-place instead of creating a new 200*200 matrix, is > there some way to do that ? > - run the garbage collector often, to avoid running low on memory, which > prevents the Hadoop shuffles from running in memory and has them spill to > disk > - use map-only jobs (ie using the default IdentityReducer) where possible, > and setting conf.setNumReducers(0) in these cases (my sentence data doesn't > need merging/sorting) > - JVM reuse is not accessible in Hadoop 0.18.3, which is in use on AEMR > - write my own combiner to speed up the shuffle phase (not sure what I can > achieve) > > In Mahout, the Dirichlet and LDA packages follow this iterative pattern, > albeit with only one MR job per iteration, not 9. Can you give me some > advice from the experience you have with such iterative MR jobs ? Any > places > in the Mahout code I should read ? Any optimization I'm not thinking of, > tuning I should consider/check for ? If someone would like to look at my > code, I would be happy to share it. > > Thanks ! > Sebastien > -- Ted Dunning, CTO DeepDyve
