[jira] [Updated] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

Dhruv Kumar (JIRA) Wed, 06 Apr 2011 13:02:53 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dhruv Kumar updated MAHOUT-627:
-------------------------------

    Description: 
Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
Model Training. 

Student Name: Dhruv Kumar 

Student E-mail: [email protected] 

Organization/Project: Apache Mahout 

Assigned Mentor: 

Proposal Abstract: 

The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
because of its superior numerical stability and its ability to guarantee the 
discovery of a locally maximum,  Maximum Likelihood Estimator, in the presence 
of incomplete training data. Currently, Apache Mahout has a sequential 
implementation of the Baum-Welch which cannot be scaled to train over large 
data sets. This restriction reduces the quality of training and as an effect, 
constraints generalization of the learned model in production environments. 
This project proposes to extend Mahout's sequential implementation of the 
Baum-Welch to a parallel, distributed version using the Map-Reduce programming 
framework to allow training at a large scale for enhanced model fitting. 

Detailed Description: 

Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
for applications generating temporal or spatial sequential data. Relative 
simplicity of implementation, combined with their ability to discover latent 
domain knowledge have made them very popular in diverse fields such as DNA 
sequence alignment, gene discovery, handwriting analysis, voice recognition, 
computer vision, language translation and parts-of-speech tagging. 

A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
unobservable, hidden states emitting symbols from a finite observable 
vocabulary set O according to a probabilistic model Theta. The parameters of 
the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) 
of A specify the probability of transitioning from a state i to state j. Matrix 
B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) 
provide the probability that a symbol o will be emitted from the hidden state 
s. The elements pi_(s) of the |S| length vector Pi determine the probability 
that the system starts in the hidden state s. The transitions of hidden states 
are unobservable and follow the Markov property of memorylessness. 

Rabiner [1] defined three main problems for HMMs: 

1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
observation sequence, determine the probability that the model generated the 
observed sequence. This is useful for evaluating the quality of the model and 
is solved using the so called Forward algorithm. 

2. Decoding: Given the complete model (S, O, Theta) and an observation 
sequence, determine the hidden state sequence which generated the observed 
sequence. This can be viewed as an inference problem where the model and 
observed sequence are used to predict the value of the unobservable random 
variables. The backward algorithm, also known as the Viterbi decoding algorithm 
is used for predicting the hidden state sequence. 

3. Training: Given the set of hidden states S, the set of observation 
vocabulary O and the observation sequence, determine the parameters (A, B, Pi) 
of the model Theta. This problem can be viewed as a statistical machine 
learning problem of model fitting to a large set of training data. The 
Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the 
Viterbi training algorithm are commonly used for model fitting. 

In general, the quality of HMM training can be improved by employing large 
training vectors but currently, Mahout only supports sequential versions of HMM 
trainers which are incapable of scaling.  Among the Viterbi and the Baum-Welch 
training methods, the Baum-Welch algorithm is superior, accurate, and a better 
candidate for a parallel implementation for two reasons:
(1) The BW is numerically stable and provides a guaranteed discovery of the 
locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters 
(Theta). In Viterbi training, the MLE is approximated in order to reduce 
computation time. 
(2) The BW belongs to the general class of Expectation Maximization (EM) 
algorithms which naturally fit into the Map-Reduce framework [2]. 

Hence, this project proposes to extend Mahout's current sequential 
implementation of the Baum-Welch HMM trainer to a scalable, distributed case. 
Since the distributed version of the BW will use the sequential implementations 
of the Forward and the Backward algorithms to compute the alpha and the beta 
factors in each iteration, a lot of existing HMM training code will be reused. 
Specifically, the parallel implementation of the BW algorithm on Map Reduce has 
been elaborated at great length in [3] by viewing it as a specific case of the 
Expectation-Maximization algorithm and will be followed for implementation in 
this project. 

The BW EM algorithm iteratively refines the model's parameters and consists of 
two distinct steps in each iteration--Expectation and Maximization. In the 
distributed case, the Expectation step is computed by the mappers and the 
reducers, while the Maximization is handled by the reducers. Starting from an 
initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is 
input to the algorithm, and the end result Theta^(i+1) is fed to the next 
iteration i+1. The iteration stops on a user specified convergence condition 
expressed as a fixpoint or when the number of iterations exceeds a user defined 
value. 

Expectation computes the posterior probability of each latent variable for each 
observed variable, weighed by the relative frequency of the observed variable 
in the input split. The mappers process independent training instances and emit 
expected state transition and emission counts using the forward-backward 
algorithm. The reducers finish Expectation by aggregating the expected counts. 
The input to a mapper consists of (k, v_o) pairs where k is a unique key and 
v_o is a string of observed symbols. For each training instance, the mappers 
emit the same set of keys corresponding to the following three optimization 
problems to be solved during the Maximization, and their values in a hash-map:
(1) Expected number of times a hidden state is reached (Pi).
(2) Number of times each observable symbol is generated by each hidden state 
(B).
(3) Number of transitions between each pair of states in the hidden state space 
(A). 

The M step computes the updated Theta(i+1) from the values generated during the 
E part. This involves aggregating the values (as hash-maps) for each key 
corresponding to one of the optimization problems. The aggregation summarizes 
the statistics necessary to compute a subset of the parameters for the next EM 
iteration. The optimal parameters for the next iteration are arrived by 
computing the relative frequency of each event with respect to its expected 
count at the current iteration. The emitted optimal parameters by each reducer 
are written to the HDFS and are fed to the mappers in the next iteration. 

The project can be subdivided into distinct tasks of programming, testing and 
documenting the driver, mapper, reducer and the combiner with the Expectation 
and Maximization parts split between them. For each of these tasks, a new class 
will be programmed, unit tested and documented within the 
org.apache.mahout.classifier.sequencelearning.hmm package. A list of 
milestones, associated deliverable and high level implementation details is 
given below. 

Time-line: April 26 - Aug 15. 

Milestones: 

April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my 
mentor, refine the project's plan and requirements, understand the community's 
code styling requirements, expand the knowledge on Hadoop and Mahout internals. 
Thoroughly familiarize with the classes within the 
classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math 
packages. 

May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the 
class HmmDriver by extending the AbstractJob class and by reusing the code from 
the KMeansDriver class. 

June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the 
class HmmMapper. The HmmMapper class will include setup() and map() methods. 
The setup() method will read in the HmmModel and the parameter values obtained 
from the previous iteration. The map() method will call the 
HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and 
complete the Expectation step partially. 

July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the 
class HmmReducer. The reducer will complete the Expectation step by summing 
over all the occurences emitted by the mappers for the three optimization 
problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if 
possible. Also, mid-term review.

July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the 
class HmmCombiner. The combiner will reduce the network traffic and improve 
efficiency by aggregating the values for each of the three keys corresponding 
to each of the optimization problems for the Maximization stage in reducers. 
Look at the possibility of code reuse from the KMeansCombiner class. 

July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, 
combiner and driver together. Give an example demonstrating the new parallel BW 
algorithm by employing the parts-of-speech tagger data set also used by the 
sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. 

Additional Information: 

I am in the final stages of finishing my Master's degree in Electrical and 
Computer Engineering from the University of Massachusetts Amherst. Working 
under the guidance of Prof. Wayne Burleson, as part of my Master's research 
work, I have applied the theory of Markov Decision Process (MDP) to increase 
the duration of service of mobile computers. This semester I am involved with 
two course projects involving machine learning over large data sets. In the 
Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the 
dependence of side chain geometry on a protein's secondary structure, and 
comparing it with the Dynamic Bayesian Network approach used in [6]. In another 
project for the Online Social Networks class, I am using reinforcement learning 
to build an online recommendation system by reformulating MDP optimal policy 
search as an EM problem [7] and employing Map Reduce (extending Mahout) to 
arrive at it in a scalable, distributed manner. 
I owe much to the open source community as all my research experiments have 
only been possible due to the freely available Linux distributions, performance 
analyzers, scripting languages and associated documentation. After joining the 
Apache Mahout's developer mailing list a few weeks ago,  I have found the 
community extremely vibrant, helpful and welcoming. If selected, I feel that 
the GSOC 2011 project will be a great learning experience for me from both a 
technical and professional standpoint and will also allow me to contribute 
within my modest means to the overall spirit of open source programming and 
Machine Learning. 

References: 

[1] A tutorial on hidden markov models and selected applications in speech 
recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), 
pp. 257-286. 

[2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, 
Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS 
(2006), pp. 281-288. 

[3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. 
Morgan & Claypool 2010. 

[4] http://flexcrfs.sourceforge.net/#Case_Study 

[5] http://www.rcsb.org/pdb/home/home.do

[6] Beyond rotamers: a generative, probabilistic model of side chains in 
proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, 
Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

[7] Probabilistic inference for solving discrete and continuous state Markov 
Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.

  was:
Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
Model Training. 

Student Name: Dhruv Kumar 

Student E-mail: [email protected] 

Organization/Project: Apache Mahout 

Assigned Mentor: 

Proposal Abstract: 

The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as 
it is numerically stable and provides a guaranteed discovery of the Maximum 
Likelihood Estimator in the presence of incomplete data. Currently, Apache 
Mahout has a sequential implementation of the Baum-Welch which cannot be scaled 
to train over large data sets. This project proposes to extend the sequential 
implementation of the Baum-Welch to a parallel, distributed version using the 
Map Reduce programming framework to allow scalable Hidden Markov Model 
training. 

Detailed Description: 

Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
for applications generating temporal or spatial sequential data. Their relative 
simplicity of implementation and their ability to discover latent domain 
knowledge have made them very popular in fields such as DNA sequence alignment, 
handwriting analysis, voice recognition, computer vision and parts-of-speech 
tagging. 

A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
unobservable, hidden states emitting symbols from a finite observable 
vocabulary set O according to a probabilistic model Theta. The parameters of 
the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) 
of A specify the probability of transitioning from a state i to state j. Matrix 
B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) 
provide the probability that a symbol o will be emitted from the hidden state 
s. The elements pi_(s) of the |S| length vector Pi determine the probability 
that the system starts in the hidden state s. The transitions of hidden states 
are unobservable and follow the Markov property of memorylessness. 

Rabiner [1] defined three main problems for HMMs: 

1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
observation sequence, determine the probability that the model generated the 
observed sequence. This is useful for determining the quality of the model and 
is solved using the so called Forward algorithm. 

2. Decoding: Given the complete model (S, O, Theta) and an observation 
sequence, determine the hidden state sequence which generated the observed 
sequence. This can be viewed as an inference problem where the model and 
observed sequence are used to predict the value of the unobservable random 
variables. The backward algorithm, also known as the Viterbi decoding algorithm 
is used for predicting the hidden state sequence. 

3. Training: Given the set of hidden states S, the set of observation 
vocabulary O and the observation sequence, determine the parameters (A, B, Pi) 
of the model Theta. This problem can be viewed as a statistical machine 
learning problem of model fitting to a large set of training data. The 
Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the 
Viterbi training algorithm are commonly used for model fitting. 

In general, the quality of HMM training can be improved by employing large 
training vectors but currently, Mahout only supports sequential versions of HMM 
trainers which are incapable of scaling.  Among the Viterbi and the Baum-Welch 
training methods, the Baum-Welch algorithm is slower but more accurate, and a 
better candidate for a parallel implementation for two reasons:
(1) The BW is more numerically stable and provides a guaranteed local maximum 
of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In 
Viterbi training, the MLE is approximated in order to reduce computation time. 
(2) The BW belongs to the general class of Expectation Maximization (EM) 
algorithms which naturally fit into the Map Reduce framework [2]. 

Hence, this project proposes to extend Mahout's current sequential 
implementation of the Baum-Welch HMM trainer to a scalable, distributed case. 
Since the distributed version of the BW will use the sequential implementations 
of the Forward and the Backward algorithms to compute the alpha and the beta 
factors in each iteration, a lot of existing HMM training code will be reused. 
Specifically, the parallel implementation of the BW algorithm on Map Reduce has 
been elaborated at great length in [3] by viewing it as a specific case of the 
Expectation-Maximization algorithm and will be followed for implementation in 
this project. 

The BW EM algorithm iteratively refines the model's parameters and consists of 
two distinct steps in each iteration--Expectation and Maximization. In the 
distributed case, the Expectation step is computed by the mappers and the 
reducers, while the Maximization is handled by the reducers. Starting from an 
initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is 
input to the algorithm, and the end result Theta^(i+1) is fed to the next 
iteration i+1. The iteration stops on a user specified convergence condition 
expressed as a fixpoint or when the number of iterations exceeds a user defined 
value. 

Expectation computes the posterior probability of each latent variable for each 
observed variable, weighed by the relative frequency of the observed variable 
in the input split. The mappers process independent training instances and emit 
expected state transition and emission counts using the forward-backward 
algorithm. The reducers finish Expectation by aggregating the expected counts. 
The input to a mapper consists of (k, v_o) pairs where k is a unique key and 
v_o is a string of observed symbols. For each training instance, the mappers 
emit the same set of keys corresponding to the following three optimization 
problems to be solved during the Maximization, and their values in a hash-map:
(1) Expected number of times a hidden state is reached (Pi).
(2) Number of times each observable symbol is generated by each hidden state 
(B).
(3) Number of transitions between each pair of states in the hidden state space 
(A). 

The M step computes the updated Theta(i+1) from the values generated during the 
E part. This involves aggregating the values (as hash-maps) for each key 
corresponding to one of the optimization problems. The aggregation summarizes 
the statistics necessary to compute a subset of the parameters for the next EM 
iteration. The optimal parameters for the next iteration are arrived by 
computing the relative frequency of each event with respect to its expected 
count at the current iteration. The emitted optimal parameters by each reducer 
are written to the HDFS and are fed to the mappers in the next iteration. 

The project can be subdivided into distinct tasks of programming, testing and 
documenting the driver, mapper, reducer and the combiner with the Expectation 
and Maximization parts split between them. For each of these tasks, a new class 
will be programmed, unit tested and documented within the 
org.apache.mahout.classifier.sequencelearning.hmm package. A list of 
milestones, associated deliverable and high level implementation details is 
given below. 

Time-line: April 26 - Aug 15. 

Milestones: 

April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my 
mentor, refine the project's plan and requirements, understand the community's 
code styling requirements, expand the knowledge on Hadoop and Mahout internals. 
Thoroughly familiarize with the classes within the 
classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math 
packages. 

May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the 
class HmmDriver by extending the AbstractJob class and by reusing the code from 
the KMeansDriver class. 

June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the 
class HmmMapper. The HmmMapper class will include setup() and map() methods. 
The setup() method will read in the HmmModel and the parameter values obtained 
from the previous iteration. The map() method will call the 
HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and 
complete the Expectation step partially. 

July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the 
class HmmReducer. The reducer will complete the Expectation step by summing 
over all the occurences emitted by the mappers for the three optimization 
problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if 
possible. Also, mid-term review.

July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the 
class HmmCombiner. The combiner will reduce the network traffic and improve 
efficiency by aggregating the values for each of the three keys corresponding 
to each of the optimization problems for the Maximization stage in reducers. 
Look at the possibility of code reuse from the KMeansCombiner class. 

July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, 
combiner and driver together. Give an example demonstrating the new parallel BW 
algorithm by employing the parts-of-speech tagger data set also used by the 
sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. 

Additional Information: 

I am in the final stages of finishing my Master's degree in Electrical and 
Computer Engineering from the University of Massachusetts Amherst. Working 
under the guidance of Prof. Wayne Burleson, as part of my Master's research 
work, I have applied the theory of Markov Decision Process (MDP) to increase 
the duration of service of mobile computers. This semester I am involved with 
two course projects involving machine learning over large data sets. In the 
Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the 
dependence of side chain geometry on a protein's secondary structure, and 
comparing it with the Dynamic Bayesian Network approach used in [6]. In another 
project for the Online Social Networks class, I am using reinforcement learning 
to build an online recommendation system by reformulating MDP optimal policy 
search as an EM problem [7] and employing Map Reduce (extending Mahout) to 
arrive at it in a scalable, distributed manner. 
I owe much to the open source community as all my research experiments have 
only been possible due to the freely available Linux distributions, performance 
analyzers, scripting languages and documentation. Since joining the Apache dev 
mailing list,  I have found the Apache Mahout's developer community vibrant, 
helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a 
great learning experience for me from both a technical and professional 
standpoint and also allow me to contribute within my modest means to the 
overall spirit of open source programming. 

References: 

[1] A tutorial on hidden markov models and selected applications in speech 
recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), 
pp. 257-286. 

[2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, 
Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS 
(2006), pp. 281-288. 

[3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. 
Morgan & Claypool 2010. 

[4] http://flexcrfs.sourceforge.net/#Case_Study 

[5] http://www.rcsb.org/pdb/home/home.do

[6] Beyond rotamers: a generative, probabilistic model of side chains in 
proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, 
Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

[7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete 
and continuous state Markov Decision
Processes. In ICML, 2006.


> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-627
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-627
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Dhruv Kumar
>              Labels: gsoc, gsoc2011, mahout-gsoc-11
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: [email protected] 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and as 
> an effect, constraints generalization of the learned model in production 
> environments. This project proposes to extend Mahout's sequential 
> implementation of the Baum-Welch to a parallel, distributed version using the 
> Map-Reduce programming framework to allow training at a large scale for 
> enhanced model fitting. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a parallel implementation for two reasons:
> (1) The BW is numerically stable and provides a guaranteed discovery of the 
> locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters 
> (Theta). In Viterbi training, the MLE is approximated in order to reduce 
> computation time. 
> (2) The BW belongs to the general class of Expectation Maximization (EM) 
> algorithms which naturally fit into the Map-Reduce framework [2]. 
> Hence, this project proposes to extend Mahout's current sequential 
> implementation of the Baum-Welch HMM trainer to a scalable, distributed case. 
> Since the distributed version of the BW will use the sequential 
> implementations of the Forward and the Backward algorithms to compute the 
> alpha and the beta factors in each iteration, a lot of existing HMM training 
> code will be reused. Specifically, the parallel implementation of the BW 
> algorithm on Map Reduce has been elaborated at great length in [3] by viewing 
> it as a specific case of the Expectation-Maximization algorithm and will be 
> followed for implementation in this project. 
> The BW EM algorithm iteratively refines the model's parameters and consists 
> of two distinct steps in each iteration--Expectation and Maximization. In the 
> distributed case, the Expectation step is computed by the mappers and the 
> reducers, while the Maximization is handled by the reducers. Starting from an 
> initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is 
> input to the algorithm, and the end result Theta^(i+1) is fed to the next 
> iteration i+1. The iteration stops on a user specified convergence condition 
> expressed as a fixpoint or when the number of iterations exceeds a user 
> defined value. 
> Expectation computes the posterior probability of each latent variable for 
> each observed variable, weighed by the relative frequency of the observed 
> variable in the input split. The mappers process independent training 
> instances and emit expected state transition and emission counts using the 
> forward-backward algorithm. The reducers finish Expectation by aggregating 
> the expected counts. The input to a mapper consists of (k, v_o) pairs where k 
> is a unique key and v_o is a string of observed symbols. For each training 
> instance, the mappers emit the same set of keys corresponding to the 
> following three optimization problems to be solved during the Maximization, 
> and their values in a hash-map:
> (1) Expected number of times a hidden state is reached (Pi).
> (2) Number of times each observable symbol is generated by each hidden state 
> (B).
> (3) Number of transitions between each pair of states in the hidden state 
> space (A). 
> The M step computes the updated Theta(i+1) from the values generated during 
> the E part. This involves aggregating the values (as hash-maps) for each key 
> corresponding to one of the optimization problems. The aggregation summarizes 
> the statistics necessary to compute a subset of the parameters for the next 
> EM iteration. The optimal parameters for the next iteration are arrived by 
> computing the relative frequency of each event with respect to its expected 
> count at the current iteration. The emitted optimal parameters by each 
> reducer are written to the HDFS and are fed to the mappers in the next 
> iteration. 
> The project can be subdivided into distinct tasks of programming, testing and 
> documenting the driver, mapper, reducer and the combiner with the Expectation 
> and Maximization parts split between them. For each of these tasks, a new 
> class will be programmed, unit tested and documented within the 
> org.apache.mahout.classifier.sequencelearning.hmm package. A list of 
> milestones, associated deliverable and high level implementation details is 
> given below. 
> Time-line: April 26 - Aug 15. 
> Milestones: 
> April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my 
> mentor, refine the project's plan and requirements, understand the 
> community's code styling requirements, expand the knowledge on Hadoop and 
> Mahout internals. Thoroughly familiarize with the classes within the 
> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and 
> math packages. 
> May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the 
> class HmmDriver by extending the AbstractJob class and by reusing the code 
> from the KMeansDriver class. 
> June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the 
> class HmmMapper. The HmmMapper class will include setup() and map() methods. 
> The setup() method will read in the HmmModel and the parameter values 
> obtained from the previous iteration. The map() method will call the 
> HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() 
> and complete the Expectation step partially. 
> July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the 
> class HmmReducer. The reducer will complete the Expectation step by summing 
> over all the occurences emitted by the mappers for the three optimization 
> problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if 
> possible. Also, mid-term review.
> July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document 
> the class HmmCombiner. The combiner will reduce the network traffic and 
> improve efficiency by aggregating the values for each of the three keys 
> corresponding to each of the optimization problems for the Maximization stage 
> in reducers. Look at the possibility of code reuse from the KMeansCombiner 
> class. 
> July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, 
> combiner and driver together. Give an example demonstrating the new parallel 
> BW algorithm by employing the parts-of-speech tagger data set also used by 
> the sequential BW [4]. Tidy up code and fix loose ends, finish wiki 
> documentation. 
> Additional Information: 
> I am in the final stages of finishing my Master's degree in Electrical and 
> Computer Engineering from the University of Massachusetts Amherst. Working 
> under the guidance of Prof. Wayne Burleson, as part of my Master's research 
> work, I have applied the theory of Markov Decision Process (MDP) to increase 
> the duration of service of mobile computers. This semester I am involved with 
> two course projects involving machine learning over large data sets. In the 
> Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the 
> dependence of side chain geometry on a protein's secondary structure, and 
> comparing it with the Dynamic Bayesian Network approach used in [6]. In 
> another project for the Online Social Networks class, I am using 
> reinforcement learning to build an online recommendation system by 
> reformulating MDP optimal policy search as an EM problem [7] and employing 
> Map Reduce (extending Mahout) to arrive at it in a scalable, distributed 
> manner. 
> I owe much to the open source community as all my research experiments have 
> only been possible due to the freely available Linux distributions, 
> performance analyzers, scripting languages and associated documentation. 
> After joining the Apache Mahout's developer mailing list a few weeks ago,  I 
> have found the community extremely vibrant, helpful and welcoming. If 
> selected, I feel that the GSOC 2011 project will be a great learning 
> experience for me from both a technical and professional standpoint and will 
> also allow me to contribute within my modest means to the overall spirit of 
> open source programming and Machine Learning. 
> References: 
> [1] A tutorial on hidden markov models and selected applications in speech 
> recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 
> (1989), pp. 257-286. 
> [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. 
> Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. 
> In NIPS (2006), pp. 281-288. 
> [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. 
> Morgan & Claypool 2010. 
> [4] http://flexcrfs.sourceforge.net/#Case_Study 
> [5] http://www.rcsb.org/pdb/home/home.do
> [6] Beyond rotamers: a generative, probabilistic model of side chains in 
> proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, 
> Hamelryck T. BMC Bioinformatics. 2010 Jun 5.
> [7] Probabilistic inference for solving discrete and continuous state Markov 
> Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

Reply via email to