Re: GSOC 2011 Proposal Draft: Parallelization of Baum-Welch Algorithm for HMM Training

Dhruv Wed, 06 Apr 2011 09:03:44 -0700

On Wed, Apr 6, 2011 at 9:42 AM, Grant Ingersoll <[email protected]> wrote:


> I think this looks really good.
>

Thanks!

I'll be polishing it more today on Mahout-627, document any changes in the
comments section and will correspondingly update the proposal on the GSOC
site.

Just a few quick questions:

1. Is it better to start writing a driver first, followed by the mapper,
combiner and reducer or, write the driver at the end? Is it just a matter of
personal style of a top down vs bottom up development or do these approaches
have any tangible pros and cons in the context of iterative Map Reduce
programming?

2. Is there a preferred IDE (Eclipse, IntelliJ or...)? I've been using
Eclipse so far.

3. Is there a place where I can find guidelines for code formatting specific
to Mahout? Things like indentation, class names, comments etc.

Dhruv


>
> On Apr 4, 2011, at 6:34 PM, Dhruv Kumar wrote:
>
> > Here's a first draft of my proposal. It would be great if I can get any
> > feedback before I submit it to GSOC and update Mahout-627.
> >
> > Thank you.
> >
> > Dhruv
> >
> > --------- Proposal Begin ---------
> >
> > Proposal Title: Implementation of the Baum-Welch Algorithm for parallel
> > Hidden Markov Model training on Map-Reduce.
> >
> > Student Name: Dhruv Kumar
> >
> > Student E-mail: [email protected]
> >
> > Organization/Project: Apache Mahout
> >
> > Assigned Mentor:
> >
> > Proposal Abstract:
> >
> > The Baum-Welch algorithm is commonly used for training a Hidden Markov
> > Model as it is numerically stable and provides a guaranteed discovery of
> > the Maximum Likelihood Estimator in the presence of incomplete data.
> > Currently, Apache Mahout has a sequential implementation of the
> > Baum-Welch which cannot be scaled to train over large data-sets. This
> > project proposes to extend the sequential implementation of the
> > Baum-Welch to a parallel, distributed version using the Map Reduce
> > programming framework to allow scalable Hidden Markov Model training.
> >
> > Detailed Description:
> >
> > Hidden Markov Models (HMM) are widely used as a probabilistic inference
> > tool for applications generating temporal or spatial sequential data.
> > Their relative simplicity of implementation and their ability to
> > discover latent domain knowledge have made them very popular in fields
> > such as DNA sequence alignment, handwriting analysis, voice recognition,
> > computer vision and parts-of-speech tagging.
> >
> > A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
> > unobservable, hidden states emitting symbols from a finite observable
> > vocabulary set O according to a probabilistic model Theta. The
> > parameters of the model Theta are defined by the tuple (A, B, Pi) where
> > A is a stochastic transition matrix of the hidden states of size |S| X
> > |S|. The elements a_(i,j) of A specify the probability of transitioning
> > from a state i to state j. Matrix B is a size |S| X |O| stochastic
> > symbol emission matrix whose elements b_(s, o) provide the probability
> > that a symbol o will be emitted from the hidden state s. The elements
> > pi_(s) of the |S| length vector Pi determine the probability that the
> > system starts in the hidden state s. The transitions of hidden states
> > are unobservable and follow the Markov property of memorylessness.
> >
> > Rabiner [1] defined three main problems for HMMs:
> >
> > 1. Evaluation: Given the complete model (S, O, Theta) and a subset of
> > the observation sequence, determine the probability that the model
> > generated the observed sequence. This is useful for determining the
> > quality of the model and is solved using the so called Forward
> > algorithm.
> >
> > 2. Decoding: Given the complete model (S, O, Theta) and an observation
> > sequence, determine the hidden state sequence which generated the
> > observed sequence. This can be viewed as an inference problem where the
> > model and observed sequence are used to predict the value of the
> > unobservable random variables. The backward algorithm, also known as the
> > Viterbi decoding algorithm is used for predicting the hidden state
> > sequence.
> >
> > 3. Training: Given the set of hidden states S, the set of observation
> > vocabulary O and the observation sequence, determine the parameters (A,
> > B, Pi) of the model Theta. This problem can be viewed as a statistical
> > machine learning problem of model fitting to a large set of training
> > data. The Baum-Welch (BW) algorithm (also called the Forward-Backward
> > algorithm) and the Viterbi training algorithm are commonly used for
> > model fitting.
> >
> > Since most problems modeled by HMMs have a modest number of hidden and
> > observable states, the sequential versions of the Forward and the
> > Viterbi algorithms (currently implemented in Mahout) are sufficient for
> > the evaluation and decoding purposes. However, often the training data
> > is so large that a single compute node is incapable of handling it.
> > Attempts to prune the training data can lead to lower quality model
> > fitting and may miss out on crucial domain knowldege because of inferior
> > posterior probabilities. Currently, Mahout only supports sequential HMM
> > trainers--Viterbi and Baum Welch which are incapable of scaling to large
> > data sets.
> >
> > This project proposes to extend Mahout's current sequential
> > implementation of the Baum Welch HMM trainer to a scalable, distributed
> > case. The distributed version of the BW will use the sequential
> > implementations of the Forward and the Backward algorithms in each
> > iteration, hence a lot of exisitng HMM training code will be reused.
> >
> > Among the two sequential HMM training methods, the Baum-Welch (BW) or
> > the Forward-Backward algorithm is superiand a better candidate for a
> > parallel implementation for two reasons. (1) The BW is more numerically
> > stable and provides guaranteed discovery of Maximum Likelihood Estimator
> > (MLE) (albiet a local maximum) unlike Viterbi training where the MLE is
> > approximated in order to reduce computation time. (2) The BW belongs to
> > the general class of Expectation Maximization (EM) algorithms which
> > naturally fit into the Map Reduce framework [2]. Specifically, the
> > parallel implementation of the BW algorithm on Map Reduce has been
> > elaborated at great length in [3].
> >
> > The BW EM algorithm iteratively refines the model's parameters and
> > consists of two distinct steps in each iteration--Expectation (E) and
> > Maximization (M). In the distributed case, the E step is computed by the
> > mappers and the reducers, while the M is handled by the reducers.
> > Starting from an initial Theta(0), in each iteration i, the model
> > parameter tuple Theta(i) is input to the algorithm, and the end result
> > Theta(i+1) is fed to the next iteration i+1. The iteration stops on a
> > user specified convergence condition expressed as a fixpoint or when the
> > number of iterations exceeds a user defined value.
> >
> > The E step computes the posterior probability of each latent varaiable
> > for each observed variable, weighed by the relative frequency of the
> > observered variable in the input split. The mappers process independent
> > training instances and emit expected state transition and emission
> > counts using the forward-backward algorithm. The reducers finish E by
> > aggregating the expected counts. The input to a mapper consists of (k,
> > v_o) pairs where k is a unique key and v_o is a string of observerd
> > symbols of length n. For each training instance, the mappers emit the
> > same set of keys corresponding to the following three optimization
> > problems to be solved during the M: (1) expected number of times a
> > hidden state is reached, (2) the number of times each observable symbol
> > is generated by each hidden state and, (3) the number of transitions
> > between each pair of states in the hidden state space.
> >
> > The M step computes the updated Theta(i+1) from the values generated
> > during the E part. This involves aggregating the values obtained in the
> > E step for each key corresponding to one of the optimization problems.
> > The aggregation summarizes the statistics necessary to compute a subset
> > of the parameters for the next EM iteration. The optimal parameters for
> > the next iteration are arrived by computing the relative frequency of
> > each event with respect to its expected count at the current iteration.
> > The emitted optimal parameters by each reducer are written to the HDFS
> > and are fed to the mappers in the next iteration.
> >
> > The project can be subdivided into distinct tasks of programming,
> > testing and documenting the driver, mapper, reducer and the combiner. A
> > list of milestones with their associated deliverables is given below.
> >
> > Timeline: April 26 - Aug 15.
> >
> > Milestones:
> >
> > April 26 - May 22 (4 weeks): Pre coding tasks. Open communication with
> > my mentor, refine the project's plan and requirements, understand the
> > code styling requirements, expand the knowledge on Hadoop and Mahout's
> > internals. Thoroughly familiarize with the classes within the
> > classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer
> > and math packages.
> >
> > May 23 - June 3 (2 weeks): Driver. Implement and test the class
> > HmmDriver by extending the AbstractJob class and by reusing the code
> > from the KMeansDriver class.
> >
> > June 3 - July 1 (4 weeks): Mapper. Implement and test the class
> > HmmMapper. The HmmMapper class will include setup() and map() methods.
> > The setup() method will read in the HmmModel and the parameter values
> > obtained from the previous iteration. The map() method will call the
> > HmmAlgorithms.backwardAlgorithm() and the
> > HmmAlgorithms.forwardAlgorithm() to compute the E step partially.
> >
> > July 1 - July 15 (2 weeks): Reducer. Implement and test the class
> > HmmReducer. The reducer will complete the E step by summing over all the
> > occurences emitted by the mappers for the three optimization problems.
> > Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.
> >
> > July 15 - July 29 (2 weeks): Combiner. Implement and test the class
> > HmmCombiner. The combiner will reduce the network traffic and improve
> > efficiency by aggregating the values for each of the three keys
> > corresponding to each of the optimization problems for the M stage. Look
> > at the possibility of code reuse from KMeansCombiner.
> >
> > July 29 - August 15 (2 weeks): Give an example demonstrating the new
> > parallel BW algorithm by employing the parts-of-speech tagger data set
> > also used by the sequential BW. Tidy up code and fix loose ends, conduct
> > final tests, finish any remaining documentation.
> >
> > Additional Information:
> >
> > I am in the final stages of finishing my Master's degree in Electrical
> > and Computer Engineering from the University of Massachusetts Amherst.
> > Working under the guidance of Prof. Wayne Burleson, as part of my
> > Master's research work, I have applied the theory of Markov Decision
> > Process (MDP) to increase the duration of service of mobile computers.
> > This semester I am involved with two course projects involving machine
> > learning over large data sets. In a Bioinformatics class I am mining the
> > RCSB Protein Data Bank to learn the dependence of side chain geometry on
> > the protein's secondary structure. In another project for the Online
> > Social Networks class, I am building an online recommendation system by
> > using the MDP theory and recasting the problem to be an instance of
> > sequential optimal control.
> >
> > I owe much to the open source community as all my research experiments
> > have only been possible due to the freely available Linux distributions,
> > open source performance analyzers, scripting languages such as Python
> > and the suite of GNU tools. I have found the Apache Mahout's developer
> > community vibrant, helpful and welcoming. If selected, I feel that the
> > GSOC 2011 project will be a great learning experience for me from both a
> > technical and professional standpoint and allow me to contribute within
> > my modest means to the overall spirit of open coding.
> >
> > References:
> >
> > [1] A tutorial on hidden markov models and selected
> > applications in speech recognition by Lawrence R. Rabiner. In
> > Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.
> >
> > [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang
> > K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle
> > Olukotun. In NIPS (2006), pp. 281-288.
> >
> > [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris
> > Dyer. Morgan & Claypool 2010.
> >
> >
> > --------- Proposal End -------------
> >
> >
> > On Thu, Mar 24, 2011 at 8:08 PM, Ted Dunning <[email protected]>
> wrote:
> >
> >> I would invert the order of 1 and 2 and estimate the times as 30, 30,
> 40.
> >> You can view this from a few points of view:
> >>
> >> a) writing good tests is hard
> >>
> >> b) writing good tests makes writing code much easier
> >>
> >> c) writing any kind of decent documentation is VASTLY harder than most
> >> people think.  It is also very important for user adoption, mentor
> >> satisfaction and personal marketing (for instance a resume or portfolio)
> >>
> >> On Thu, Mar 24, 2011 at 4:41 PM, Dhruv Kumar <[email protected]>
> wrote:
> >>
> >>> Thanks Ted, I'll start working on a proposal having the following sub
> >> tasks
> >>> (I have given a rudimentary percent time estimate, please feel free to
> >>> suggest alterations):
> >>>
> >>> 1. Implementing the BW on Map Reduce following the line of k-means.
> Focus
> >>> on
> >>> re-using as much of the existing k-means code as possible. (60%)
> >>>
> >>> 2. Unit testing the Mapper, Combiner, Reducer and testing the
> >> integration,
> >>> in local and pseudo-distributed modes. I may be able to get access to a
> >>> small cluster at UMass for unit testing in the real-distributed mode.
> >> (35%)
> >>>
> >>> 3. Writing clear documentation directing clients how to use the
> >> implemented
> >>> library code for their needs. (5%)
> >>>
> >>>
> >>>
> >>> On Thu, Mar 24, 2011 at 6:45 PM, Ted Dunning <[email protected]>
> >>> wrote:
> >>>
> >>>> On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <[email protected]>
> >>> wrote:
> >>>>
> >>>>> 2. Another very interesting possibility is to express the BW as a
> >>>> recursive
> >>>>> join.  There's a very interesting offshoot of Hadoop, called Haloop (
> >>>>> http://code.google.com/p/haloop/) which supports loop control, and
> >>>> caching
> >>>>> of the intermediate results on the mapper inputs,  reducer inputs and
> >>>>> reducer outputs to improve performance. The paper [1] describes this
> >> in
> >>>>> more
> >>>>> detail. They have implemented k-means as a recursive join.
> >>>>>
> >>>>
> >>>> Until there is flexibility around execution model such as the recent
> >>>> map-reduce 2.0 announcement
> >>>> from Yahoo and until that flexibility is pretty much standard, it is
> >> hard
> >>>> to
> >>>> justify this.
> >>>>
> >>>> The exception is where such extended capabilities fit into standard
> >>> hadoop
> >>>> 0.20 environments.
> >>>>
> >>>
> >>>>
> >>>>> In either case, I want to clearly define the scope and task list. BW
> >>> will
> >>>>> be
> >>>>> the core of the project but:
> >>>>>
> >>>>> 1. Does it make sense for implementing the "counting method" for
> >> model
> >>>>> discovery as well? It is clearly inferior but will it be a good
> >>> reference
> >>>>> for comparison to the BW. Any added benefit?
> >>>>>
> >>>>
> >>>> No opinion here except that increased scope decreases probability of
> >> even
> >>>> partial success.
> >>>>
> >>>>
> >>>>> 2. What has been the standard in the past GSoC Mahout projects
> >>> regarding
> >>>>> unit testing and documentation?
> >>>>>
> >>>>
> >>>> Do it.
> >>>>
> >>>> Seriously.
> >>>>
> >>>> We use junit 4+ and very much prefer strong unit tests.  Nothing in
> >> what
> >>>> you
> >>>> are proposing should
> >>>> require anything interesting in this regard.  Testing the mapper,
> >>> combiner
> >>>> and reducer in isolation is
> >>>> good.  Testing the integrated program in local mode or pseudo
> >> distributed
> >>>> mode should suffice beyond
> >>>> that.  It is best if you can separate command line argument parsing
> >> from
> >>>> execution path to that you
> >>>> can test them separately.
> >>>>
> >>>>>
> >>>>> In the meantime, I've been understanding more about Mahout, Map
> >> Reduce
> >>>> and
> >>>>> Hadoop's internals. One of my course projects this semester is to
> >>>> implement
> >>>>> the Bellman Iteration algorithm on Map Reduce and so far it has been
> >>>> coming
> >>>>> along well.
> >>>>>
> >>>>> Any feedback is much appreciated.
> >>>>>
> >>>>> Dhruv
> >>>>>
> >>>>
> >>>
> >>
>
> --------------------------
> Grant Ingersoll
> Lucene Revolution -- Lucene and Solr User Conference
> May 25-26 in San Francisco
> www.lucenerevolution.org
>
>

Re: GSOC 2011 Proposal Draft: Parallelization of Baum-Welch Algorithm for HMM Training

Reply via email to