Hi all,

I'm working on a machine learning project with Hadoop, in which I use the
Mahout matrix package.

My application effectively is an instance of Gibbs sampling, therefore
iterative. One iteration consists of 9 MR jobs, with some dependencies
between them, and some code executed between them. I need to iterate over
this thousands of times. At present an iteration takes around 7 minutes (I
run everything on Amazon Elastic MapReduce), which makes it prohibitive for
thousands of iterations. The bulk of the time is eaten up by Hadoop
overhead, not by proper number-crunching. Of the 9 MR jobs comprising an
iteration, 6 jobs take (map on) my training data, which consists of around
50 000 Wall Street Journal sentences, ie a rather small dataset. 2 jobs maps
on cells of a 200*200 matrix, the last one on 50 000 words.

I'm aware that generally speaking, Hadoop might not be adapted to this sort
of job, because of the overhead involved. What better options come to mind
to distribute an iterative computation (with "mainstream", possibly open
source, software) ?

Here are the optimizations I am thinking of, in my present setting:
- run some jobs in parallel, by using JobControl. One issue I have here is
that I don't know how to have the code in-between MR jobs executed by the
JobControl-ler. I don't want to create fake MR jobs because that would only
create more overhead. Is there a solution you know of ?
- avoid constructing writables everytime I need to collect one (tip 6 from
http://www.cloudera.com/blog/2009/12/17/7-tips-for-improving-mapreduce-performance/
)
- using the Mahout matrix package, I need to change the cardinality of
matrices and vectors (eg, by appending an element/row, or removing one) --
I'd like to do this in-place instead of creating a new 200*200 matrix, is
there some way to do that ?
- run the garbage collector often, to avoid running low on memory, which
prevents the Hadoop shuffles from running in memory and has them spill to
disk
- use map-only jobs (ie using the default IdentityReducer) where possible,
and setting conf.setNumReducers(0) in these cases (my sentence data doesn't
need merging/sorting)
- JVM reuse is not accessible in Hadoop 0.18.3, which is in use on AEMR
- write my own combiner to speed up the shuffle phase (not sure what I can
achieve)

In Mahout, the Dirichlet and LDA packages follow this iterative pattern,
albeit with only one MR job per iteration, not 9. Can you give me some
advice from the experience you have with such iterative MR jobs ? Any places
in the Mahout code I should read ? Any optimization I'm not thinking of,
tuning I should consider/check for ? If someone would like to look at my
code, I would be happy to share it.

Thanks !
Sebastien

Reply via email to