Re: [jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

Dhruv Kumar Sun, 14 Aug 2011 05:18:09 -0700

Very good actually, just tying up some loose ends and I will be putting up
the patch with all the unit tests and the documentation soon!


I have to just debug the slf4j issue which I have discussed on the mailing
list with Sean and Ted...

I still need the rest of this week to tidy up the documentation a little
bit.

On Sun, Aug 14, 2011 at 7:25 AM, Grant Ingersoll (JIRA) <[email protected]>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084820#comment-13084820]
>
> Grant Ingersoll commented on MAHOUT-627:
> ----------------------------------------
>
> Hey Dhruv, nearing pencils down, how are we doing?
>
> > Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model
> Training.
> >
> -----------------------------------------------------------------------------
> >
> >                 Key: MAHOUT-627
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-627
> >             Project: Mahout
> >          Issue Type: Task
> >          Components: Classification
> >    Affects Versions: 0.4, 0.5
> >            Reporter: Dhruv Kumar
> >            Assignee: Grant Ingersoll
> >              Labels: gsoc, gsoc2011, mahout-gsoc-11
> >             Fix For: 0.6
> >
> >         Attachments: MAHOUT-627.patch, MAHOUT-627.patch,
> MAHOUT-627.patch, MAHOUT-627.patch
> >
> >
> > Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden
> Markov Model Training.
> > Student Name: Dhruv Kumar
> > Student E-mail: [email protected]
> > Organization/Project: Apache Mahout
> > Assigned Mentor:
> > Proposal Abstract:
> > The Baum-Welch algorithm is commonly used for training a Hidden Markov
> Model because of its superior numerical stability and its ability to
> guarantee the discovery of a locally maximum,  Maximum Likelihood Estimator,
> in the presence of incomplete training data. Currently, Apache Mahout has a
> sequential implementation of the Baum-Welch which cannot be scaled to train
> over large data sets. This restriction reduces the quality of training and
> constrains generalization of the learned model when used for prediction.
> This project proposes to extend Mahout's Baum-Welch to a parallel,
> distributed version using the Map-Reduce programming framework for enhanced
> model fitting over large data sets.
> > Detailed Description:
> > Hidden Markov Models (HMMs) are widely used as a probabilistic inference
> tool for applications generating temporal or spatial sequential data.
> Relative simplicity of implementation, combined with their ability to
> discover latent domain knowledge have made them very popular in diverse
> fields such as DNA sequence alignment, gene discovery, handwriting analysis,
> voice recognition, computer vision, language translation and parts-of-speech
> tagging.
> > A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
> unobservable, hidden states emitting symbols from a finite observable
> vocabulary set O according to a probabilistic model Theta. The parameters of
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic
> transition matrix of the hidden states of size |S| X |S|. The elements
> a_(i,j) of A specify the probability of transitioning from a state i to
> state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix
> whose elements b_(s, o) provide the probability that a symbol o will be
> emitted from the hidden state s. The elements pi_(s) of the |S| length
> vector Pi determine the probability that the system starts in the hidden
> state s. The transitions of hidden states are unobservable and follow the
> Markov property of memorylessness.
> > Rabiner [1] defined three main problems for HMMs:
> > 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the
> observation sequence, determine the probability that the model generated the
> observed sequence. This is useful for evaluating the quality of the model
> and is solved using the so called Forward algorithm.
> > 2. Decoding: Given the complete model (S, O, Theta) and an observation
> sequence, determine the hidden state sequence which generated the observed
> sequence. This can be viewed as an inference problem where the model and
> observed sequence are used to predict the value of the unobservable random
> variables. The backward algorithm, also known as the Viterbi decoding
> algorithm is used for predicting the hidden state sequence.
> > 3. Training: Given the set of hidden states S, the set of observation
> vocabulary O and the observation sequence, determine the parameters (A, B,
> Pi) of the model Theta. This problem can be viewed as a statistical machine
> learning problem of model fitting to a large set of training data. The
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and
> the Viterbi training algorithm are commonly used for model fitting.
> > In general, the quality of HMM training can be improved by employing
> large training vectors but currently, Mahout only supports sequential
> versions of HMM trainers which are incapable of scaling.  Among the Viterbi
> and the Baum-Welch training methods, the Baum-Welch algorithm is superior,
> accurate, and a better candidate for a parallel implementation for two
> reasons:
> > (1) The BW is numerically stable and provides a guaranteed discovery of
> the locally maximum, Maximum Likelihood Estimator (MLE) for model's
> parameters (Theta). In Viterbi training, the MLE is approximated in order to
> reduce computation time.
> > (2) The BW belongs to the general class of Expectation Maximization (EM)
> algorithms which naturally fit into the Map-Reduce framework [2], such as
> the existing Map Reduce implementation of k-means in Mahout.
> > Hence, this project proposes to extend Mahout's current sequential
> implementation of the Baum-Welch HMM trainer to a scalable, distributed
> case. Since the distributed version of the BW will use the sequential
> implementations of the Forward and the Backward algorithms to compute the
> alpha and the beta factors in each iteration, a lot of existing HMM training
> code will be reused. Specifically, the parallel implementation of the BW
> algorithm on Map Reduce has been elaborated at great length in [3] by
> viewing it as a specific case of the Expectation-Maximization algorithm and
> will be followed for implementation in this project.
> > The BW EM algorithm iteratively refines the model's parameters and
> consists of two distinct steps in each iteration--Expectation and
> Maximization. In the distributed case, the Expectation step is computed by
> the mappers and the reducers, while the Maximization is handled by the
> reducers. Starting from an initial Theta^(0), in each iteration i, the model
> parameter tuple Theta^i is input to the algorithm, and the end result
> Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user
> specified convergence condition expressed as a fixpoint or when the number
> of iterations exceeds a user defined value.
> > Expectation computes the posterior probability of each latent variable
> for each observed variable, weighed by the relative frequency of the
> observed variable in the input split. The mappers process independent
> training instances and emit expected state transition and emission counts
> using the Forward and Backward algorithms. The reducers finish Expectation
> by aggregating the expected counts. The input to a mapper consists of (k,
> v_o) pairs where k is a unique key and v_o is a string of observed symbols.
> For each training instance, the mappers emit the same set of keys
> corresponding to the following three optimization problems to be solved
> during the Maximization, and their values in a hash-map:
> > (1) Expected number of times a hidden state is reached (Pi).
> > (2) Number of times each observable symbol is generated by each hidden
> state (B).
> > (3) Number of transitions between each pair of states in the hidden state
> space (A).
> > The M step computes the updated Theta(i+1) from the values generated
> during the E part. This involves aggregating the values (as hash-maps) for
> each key corresponding to one of the optimization problems. The aggregation
> summarizes the statistics necessary to compute a subset of the parameters
> for the next EM iteration. The optimal parameters for the next iteration are
> arrived by computing the relative frequency of each event with respect to
> its expected count at the current iteration. The emitted optimal parameters
> by each reducer are written to the HDFS and are fed to the mappers in the
> next iteration.
> > The project can be subdivided into distinct tasks of programming, testing
> and documenting the driver, mapper, reducer and the combiner with the
> Expectation and Maximization parts split between them. For each of these
> tasks, a new class will be programmed, unit tested and documented within the
> org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is
> also an EM algorithm, particular attention will be paid to its code at each
> step for possible reuse.
> > A list of milestones, associated deliverable and high level
> implementation details is given below.
> > Time-line: April 26 - Aug 15.
> > Milestones:
> > April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my
> mentor, refine the project's plan and requirements, understand the
> community's code styling requirements, expand the knowledge on Hadoop and
> Mahout internals. Thoroughly familiarize with the classes within the
> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and
> math packages.
> > May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document
> the class HmmDriver by extending the AbstractJob class and by reusing the
> code from the KMeansDriver class.
> > June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document
> the class HmmMapper. The HmmMapper class will include setup() and map()
> methods. The setup() method will read in the HmmModel and the parameter
> values obtained from the previous iteration. The map() method will call the
> HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm()
> and complete the Expectation step partially.
> > July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document
> the class HmmReducer. The reducer will complete the Expectation step by
> summing over all the occurences emitted by the mappers for the three
> optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch()
> method if possible. Also, mid-term review.
> > July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and
> document the class HmmCombiner. The combiner will reduce the network traffic
> and improve efficiency by aggregating the values for each of the three keys
> corresponding to each of the optimization problems for the Maximization
> stage in reducers. Look at the possibility of code reuse from the
> KMeansCombiner class.
> > July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer,
> combiner and driver together. Give an example demonstrating the new parallel
> BW algorithm by employing the parts-of-speech tagger data set also used by
> the sequential BW [4]. Tidy up code and fix loose ends, finish wiki
> documentation.
> > Additional Information:
> > I am in the final stages of finishing my Master's degree in Electrical
> and Computer Engineering from the University of Massachusetts Amherst.
> Working under the guidance of Prof. Wayne Burleson, as part of my Master's
> research work, I have applied the theory of Markov Decision Process (MDP) to
> increase the duration of service of mobile computers. This semester I am
> involved with two course projects involving machine learning over large data
> sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank
> [5] to learn the dependence of side chain geometry on a protein's secondary
> structure, and comparing it with the Dynamic Bayesian Network approach used
> in [6]. In another project for the Online Social Networks class, I am using
> reinforcement learning to build an online recommendation system by
> reformulating MDP optimal policy search as an EM problem [7] and employing
> Map Reduce (extending Mahout) to arrive at it in a scalable, distributed
> manner.
> > I owe much to the open source community as all my research experiments
> have only been possible due to the freely available Linux distributions,
> performance analyzers, scripting languages and associated documentation.
> After joining the Apache Mahout's developer mailing list a few weeks ago,  I
> have found the community extremely vibrant, helpful and welcoming. If
> selected, I feel that the GSOC 2011 project will be a great learning
> experience for me from both a technical and professional standpoint and will
> also allow me to contribute within my modest means to the overall spirit of
> open source programming and Machine Learning.
> > References:
> > [1] A tutorial on hidden markov models and selected applications in
> speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol.
> 77 (1989), pp. 257-286.
> > [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K.
> Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun.
> In NIPS (2006), pp. 281-288.
> > [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris
> Dyer. Morgan & Claypool 2010.
> > [4] http://flexcrfs.sourceforge.net/#Case_Study
> > [5] http://www.rcsb.org/pdb/home/home.do
> > [6] Beyond rotamers: a generative, probabilistic model of side chains in
> proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE,
> Hamelryck T. BMC Bioinformatics. 2010 Jun 5.
> > [7] Probabilistic inference for solving discrete and continuous state
> Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

Reply via email to