Thanks Ted, I'll start working on a proposal having the following sub tasks (I have given a rudimentary percent time estimate, please feel free to suggest alterations):
1. Implementing the BW on Map Reduce following the line of k-means. Focus on re-using as much of the existing k-means code as possible. (60%) 2. Unit testing the Mapper, Combiner, Reducer and testing the integration, in local and pseudo-distributed modes. I may be able to get access to a small cluster at UMass for unit testing in the real-distributed mode. (35%) 3. Writing clear documentation directing clients how to use the implemented library code for their needs. (5%) On Thu, Mar 24, 2011 at 6:45 PM, Ted Dunning <[email protected]> wrote: > On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <[email protected]> wrote: > > > 2. Another very interesting possibility is to express the BW as a > recursive > > join. There's a very interesting offshoot of Hadoop, called Haloop ( > > http://code.google.com/p/haloop/) which supports loop control, and > caching > > of the intermediate results on the mapper inputs, reducer inputs and > > reducer outputs to improve performance. The paper [1] describes this in > > more > > detail. They have implemented k-means as a recursive join. > > > > Until there is flexibility around execution model such as the recent > map-reduce 2.0 announcement > from Yahoo and until that flexibility is pretty much standard, it is hard > to > justify this. > > The exception is where such extended capabilities fit into standard hadoop > 0.20 environments. > > > > In either case, I want to clearly define the scope and task list. BW will > > be > > the core of the project but: > > > > 1. Does it make sense for implementing the "counting method" for model > > discovery as well? It is clearly inferior but will it be a good reference > > for comparison to the BW. Any added benefit? > > > > No opinion here except that increased scope decreases probability of even > partial success. > > > > 2. What has been the standard in the past GSoC Mahout projects regarding > > unit testing and documentation? > > > > Do it. > > Seriously. > > We use junit 4+ and very much prefer strong unit tests. Nothing in what > you > are proposing should > require anything interesting in this regard. Testing the mapper, combiner > and reducer in isolation is > good. Testing the integrated program in local mode or pseudo distributed > mode should suffice beyond > that. It is best if you can separate command line argument parsing from > execution path to that you > can test them separately. > > > > > In the meantime, I've been understanding more about Mahout, Map Reduce > and > > Hadoop's internals. One of my course projects this semester is to > implement > > the Bellman Iteration algorithm on Map Reduce and so far it has been > coming > > along well. > > > > Any feedback is much appreciated. > > > > Dhruv > > >
