hi all, recently i've started playing with hadoop and my first learning experiment has surprised me
i'm implementing a a text problem; trying to extract infrequent phrases using a mixture of probabilistic models. it's a bit of toy problem so the algorithm details aren't super important, though the implementation might be.... this problem ended up being represented by a dozen or so map reduce jobs of various types including some aggregate steps and some manual joins. (see the bottom of this page for details http://matpalm.com/sip/take3_markov_chains.html) i've implemented each step using ruby / streaming, the code is at http://github.com/matpalm/sip if anyone cares. i ran some tests using 10 ec2 medium cpu instance across a small 100mb's of gzipped text. for validation of the results i also reimplemented the entire algorithm as a single threaded ruby app my surprise comes from finding that the ruby implementation outperforms the 10 ec2 instances on this data size... i ran a few samples of different sizes with the graph at the bottom of http://matpalm.com/sip/part4_but_does_it_scale.html so why is this? here are my explanations in order of how confident i am... a) 100mb is peanuts and hadoop was made for 1000x this size so the test is invalid. b) there is a better representation of this problem that uses fewer map/reduce passes. c) streaming is too slow and rewriting in java (and making use of techniques like chaining mappers) would speed things up d) doing these steps, particularly the joins, in pig would be faster my next steps are to rewrite some of the steps in pig to sample the difference does anyone have any high level comments on this? cheers, mat
