Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Dhruv Kumar Thu, 24 Mar 2011 16:42:04 -0700

Thanks Ted, I'll start working on a proposal having the following sub tasks
(I have given a rudimentary percent time estimate, please feel free to
suggest alterations):


1. Implementing the BW on Map Reduce following the line of k-means. Focus on
re-using as much of the existing k-means code as possible. (60%)

2. Unit testing the Mapper, Combiner, Reducer and testing the integration,
in local and pseudo-distributed modes. I may be able to get access to a
small cluster at UMass for unit testing in the real-distributed mode. (35%)

3. Writing clear documentation directing clients how to use the implemented
library code for their needs. (5%)



On Thu, Mar 24, 2011 at 6:45 PM, Ted Dunning <[email protected]> wrote:

> On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <[email protected]> wrote:
>
> > 2. Another very interesting possibility is to express the BW as a
> recursive
> > join.  There's a very interesting offshoot of Hadoop, called Haloop (
> > http://code.google.com/p/haloop/) which supports loop control, and
> caching
> > of the intermediate results on the mapper inputs,  reducer inputs and
> > reducer outputs to improve performance. The paper [1] describes this in
> > more
> > detail. They have implemented k-means as a recursive join.
> >
>
> Until there is flexibility around execution model such as the recent
> map-reduce 2.0 announcement
> from Yahoo and until that flexibility is pretty much standard, it is hard
> to
> justify this.
>
> The exception is where such extended capabilities fit into standard hadoop
> 0.20 environments.
>

>
> > In either case, I want to clearly define the scope and task list. BW will
> > be
> > the core of the project but:
> >
> > 1. Does it make sense for implementing the "counting method" for model
> > discovery as well? It is clearly inferior but will it be a good reference
> > for comparison to the BW. Any added benefit?
> >
>
> No opinion here except that increased scope decreases probability of even
> partial success.
>
>
> > 2. What has been the standard in the past GSoC Mahout projects regarding
> > unit testing and documentation?
> >
>
> Do it.
>
> Seriously.
>
> We use junit 4+ and very much prefer strong unit tests.  Nothing in what
> you
> are proposing should
> require anything interesting in this regard.  Testing the mapper, combiner
> and reducer in isolation is
> good.  Testing the integrated program in local mode or pseudo distributed
> mode should suffice beyond
> that.  It is best if you can separate command line argument parsing from
> execution path to that you
> can test them separately.
>
> >
> > In the meantime, I've been understanding more about Mahout, Map Reduce
> and
> > Hadoop's internals. One of my course projects this semester is to
> implement
> > the Bellman Iteration algorithm on Map Reduce and so far it has been
> coming
> > along well.
> >
> > Any feedback is much appreciated.
> >
> > Dhruv
> >
>

Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Reply via email to