hadoop scales but is not performant?

Mat Kelcey Mon, 14 Sep 2009 18:16:08 -0700

hi all,

recently i've started playing with hadoop and my first learning
experiment has surprised me


i'm implementing a a text problem; trying to extract infrequent
phrases using a mixture of probabilistic models.
it's a bit of toy problem so the algorithm details aren't super
important, though the implementation might be....

this problem ended up being represented by a dozen or so map reduce
jobs of various types including some aggregate steps and some manual
joins.
(see the bottom of this page for details
http://matpalm.com/sip/take3_markov_chains.html)

i've implemented each step using ruby / streaming, the code is at
http://github.com/matpalm/sip if anyone cares.

i ran some tests using 10 ec2 medium cpu instance across a small
100mb's of gzipped text.
for validation of the results i also reimplemented the entire
algorithm as a single threaded ruby app

my surprise comes from finding that the ruby implementation
outperforms the 10 ec2 instances on this data size...
i ran a few samples of different sizes with the graph at the bottom of
http://matpalm.com/sip/part4_but_does_it_scale.html

so why is this? here are my explanations in order of how confident i am...

a) 100mb is peanuts and hadoop was made for 1000x this size so the
test is invalid.
b) there is a better representation of this problem that uses fewer
map/reduce passes.
c) streaming is too slow and rewriting in java (and making use of
techniques like chaining mappers) would speed things up
d) doing these steps, particularly the joins, in pig would be faster

my next steps are to rewrite some of the steps in pig to sample the difference

does anyone have any high level comments on this?

cheers,
mat

hadoop scales but is not performant?

Reply via email to