Post a patch early so you can get feedback before it is too late to make use of it.
Sent from my iPad On Aug 14, 2011, at 5:17 AM, Dhruv Kumar <[email protected]> wrote: > Very good actually, just tying up some loose ends and I will be putting up > the patch with all the unit tests and the documentation soon! > > I have to just debug the slf4j issue which I have discussed on the mailing > list with Sean and Ted... > > I still need the rest of this week to tidy up the documentation a little > bit. > > On Sun, Aug 14, 2011 at 7:25 AM, Grant Ingersoll (JIRA) > <[email protected]>wrote: > >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084820#comment-13084820] >> >> Grant Ingersoll commented on MAHOUT-627: >> ---------------------------------------- >> >> Hey Dhruv, nearing pencils down, how are we doing? >> >>> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model >> Training. >>> >> ----------------------------------------------------------------------------- >>> >>> Key: MAHOUT-627 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-627 >>> Project: Mahout >>> Issue Type: Task >>> Components: Classification >>> Affects Versions: 0.4, 0.5 >>> Reporter: Dhruv Kumar >>> Assignee: Grant Ingersoll >>> Labels: gsoc, gsoc2011, mahout-gsoc-11 >>> Fix For: 0.6 >>> >>> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, >> MAHOUT-627.patch, MAHOUT-627.patch >>> >>> >>> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden >> Markov Model Training. >>> Student Name: Dhruv Kumar >>> Student E-mail: [email protected] >>> Organization/Project: Apache Mahout >>> Assigned Mentor: >>> Proposal Abstract: >>> The Baum-Welch algorithm is commonly used for training a Hidden Markov >> Model because of its superior numerical stability and its ability to >> guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, >> in the presence of incomplete training data. Currently, Apache Mahout has a >> sequential implementation of the Baum-Welch which cannot be scaled to train >> over large data sets. This restriction reduces the quality of training and >> constrains generalization of the learned model when used for prediction. >> This project proposes to extend Mahout's Baum-Welch to a parallel, >> distributed version using the Map-Reduce programming framework for enhanced >> model fitting over large data sets. >>> Detailed Description: >>> Hidden Markov Models (HMMs) are widely used as a probabilistic inference >> tool for applications generating temporal or spatial sequential data. >> Relative simplicity of implementation, combined with their ability to >> discover latent domain knowledge have made them very popular in diverse >> fields such as DNA sequence alignment, gene discovery, handwriting analysis, >> voice recognition, computer vision, language translation and parts-of-speech >> tagging. >>> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of >> unobservable, hidden states emitting symbols from a finite observable >> vocabulary set O according to a probabilistic model Theta. The parameters of >> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic >> transition matrix of the hidden states of size |S| X |S|. The elements >> a_(i,j) of A specify the probability of transitioning from a state i to >> state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix >> whose elements b_(s, o) provide the probability that a symbol o will be >> emitted from the hidden state s. The elements pi_(s) of the |S| length >> vector Pi determine the probability that the system starts in the hidden >> state s. The transitions of hidden states are unobservable and follow the >> Markov property of memorylessness. >>> Rabiner [1] defined three main problems for HMMs: >>> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the >> observation sequence, determine the probability that the model generated the >> observed sequence. This is useful for evaluating the quality of the model >> and is solved using the so called Forward algorithm. >>> 2. Decoding: Given the complete model (S, O, Theta) and an observation >> sequence, determine the hidden state sequence which generated the observed >> sequence. This can be viewed as an inference problem where the model and >> observed sequence are used to predict the value of the unobservable random >> variables. The backward algorithm, also known as the Viterbi decoding >> algorithm is used for predicting the hidden state sequence. >>> 3. Training: Given the set of hidden states S, the set of observation >> vocabulary O and the observation sequence, determine the parameters (A, B, >> Pi) of the model Theta. This problem can be viewed as a statistical machine >> learning problem of model fitting to a large set of training data. The >> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and >> the Viterbi training algorithm are commonly used for model fitting. >>> In general, the quality of HMM training can be improved by employing >> large training vectors but currently, Mahout only supports sequential >> versions of HMM trainers which are incapable of scaling. Among the Viterbi >> and the Baum-Welch training methods, the Baum-Welch algorithm is superior, >> accurate, and a better candidate for a parallel implementation for two >> reasons: >>> (1) The BW is numerically stable and provides a guaranteed discovery of >> the locally maximum, Maximum Likelihood Estimator (MLE) for model's >> parameters (Theta). In Viterbi training, the MLE is approximated in order to >> reduce computation time. >>> (2) The BW belongs to the general class of Expectation Maximization (EM) >> algorithms which naturally fit into the Map-Reduce framework [2], such as >> the existing Map Reduce implementation of k-means in Mahout. >>> Hence, this project proposes to extend Mahout's current sequential >> implementation of the Baum-Welch HMM trainer to a scalable, distributed >> case. Since the distributed version of the BW will use the sequential >> implementations of the Forward and the Backward algorithms to compute the >> alpha and the beta factors in each iteration, a lot of existing HMM training >> code will be reused. Specifically, the parallel implementation of the BW >> algorithm on Map Reduce has been elaborated at great length in [3] by >> viewing it as a specific case of the Expectation-Maximization algorithm and >> will be followed for implementation in this project. >>> The BW EM algorithm iteratively refines the model's parameters and >> consists of two distinct steps in each iteration--Expectation and >> Maximization. In the distributed case, the Expectation step is computed by >> the mappers and the reducers, while the Maximization is handled by the >> reducers. Starting from an initial Theta^(0), in each iteration i, the model >> parameter tuple Theta^i is input to the algorithm, and the end result >> Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user >> specified convergence condition expressed as a fixpoint or when the number >> of iterations exceeds a user defined value. >>> Expectation computes the posterior probability of each latent variable >> for each observed variable, weighed by the relative frequency of the >> observed variable in the input split. The mappers process independent >> training instances and emit expected state transition and emission counts >> using the Forward and Backward algorithms. The reducers finish Expectation >> by aggregating the expected counts. The input to a mapper consists of (k, >> v_o) pairs where k is a unique key and v_o is a string of observed symbols. >> For each training instance, the mappers emit the same set of keys >> corresponding to the following three optimization problems to be solved >> during the Maximization, and their values in a hash-map: >>> (1) Expected number of times a hidden state is reached (Pi). >>> (2) Number of times each observable symbol is generated by each hidden >> state (B). >>> (3) Number of transitions between each pair of states in the hidden state >> space (A). >>> The M step computes the updated Theta(i+1) from the values generated >> during the E part. This involves aggregating the values (as hash-maps) for >> each key corresponding to one of the optimization problems. The aggregation >> summarizes the statistics necessary to compute a subset of the parameters >> for the next EM iteration. The optimal parameters for the next iteration are >> arrived by computing the relative frequency of each event with respect to >> its expected count at the current iteration. The emitted optimal parameters >> by each reducer are written to the HDFS and are fed to the mappers in the >> next iteration. >>> The project can be subdivided into distinct tasks of programming, testing >> and documenting the driver, mapper, reducer and the combiner with the >> Expectation and Maximization parts split between them. For each of these >> tasks, a new class will be programmed, unit tested and documented within the >> org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is >> also an EM algorithm, particular attention will be paid to its code at each >> step for possible reuse. >>> A list of milestones, associated deliverable and high level >> implementation details is given below. >>> Time-line: April 26 - Aug 15. >>> Milestones: >>> April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my >> mentor, refine the project's plan and requirements, understand the >> community's code styling requirements, expand the knowledge on Hadoop and >> Mahout internals. Thoroughly familiarize with the classes within the >> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and >> math packages. >>> May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document >> the class HmmDriver by extending the AbstractJob class and by reusing the >> code from the KMeansDriver class. >>> June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document >> the class HmmMapper. The HmmMapper class will include setup() and map() >> methods. The setup() method will read in the HmmModel and the parameter >> values obtained from the previous iteration. The map() method will call the >> HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() >> and complete the Expectation step partially. >>> July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document >> the class HmmReducer. The reducer will complete the Expectation step by >> summing over all the occurences emitted by the mappers for the three >> optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() >> method if possible. Also, mid-term review. >>> July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and >> document the class HmmCombiner. The combiner will reduce the network traffic >> and improve efficiency by aggregating the values for each of the three keys >> corresponding to each of the optimization problems for the Maximization >> stage in reducers. Look at the possibility of code reuse from the >> KMeansCombiner class. >>> July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, >> combiner and driver together. Give an example demonstrating the new parallel >> BW algorithm by employing the parts-of-speech tagger data set also used by >> the sequential BW [4]. Tidy up code and fix loose ends, finish wiki >> documentation. >>> Additional Information: >>> I am in the final stages of finishing my Master's degree in Electrical >> and Computer Engineering from the University of Massachusetts Amherst. >> Working under the guidance of Prof. Wayne Burleson, as part of my Master's >> research work, I have applied the theory of Markov Decision Process (MDP) to >> increase the duration of service of mobile computers. This semester I am >> involved with two course projects involving machine learning over large data >> sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank >> [5] to learn the dependence of side chain geometry on a protein's secondary >> structure, and comparing it with the Dynamic Bayesian Network approach used >> in [6]. In another project for the Online Social Networks class, I am using >> reinforcement learning to build an online recommendation system by >> reformulating MDP optimal policy search as an EM problem [7] and employing >> Map Reduce (extending Mahout) to arrive at it in a scalable, distributed >> manner. >>> I owe much to the open source community as all my research experiments >> have only been possible due to the freely available Linux distributions, >> performance analyzers, scripting languages and associated documentation. >> After joining the Apache Mahout's developer mailing list a few weeks ago, I >> have found the community extremely vibrant, helpful and welcoming. If >> selected, I feel that the GSOC 2011 project will be a great learning >> experience for me from both a technical and professional standpoint and will >> also allow me to contribute within my modest means to the overall spirit of >> open source programming and Machine Learning. >>> References: >>> [1] A tutorial on hidden markov models and selected applications in >> speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. >> 77 (1989), pp. 257-286. >>> [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. >> Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. >> In NIPS (2006), pp. 281-288. >>> [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris >> Dyer. Morgan & Claypool 2010. >>> [4] http://flexcrfs.sourceforge.net/#Case_Study >>> [5] http://www.rcsb.org/pdb/home/home.do >>> [6] Beyond rotamers: a generative, probabilistic model of side chains in >> proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, >> Hamelryck T. BMC Bioinformatics. 2010 Jun 5. >>> [7] Probabilistic inference for solving discrete and continuous state >> Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. >> >> -- >> This message is automatically generated by JIRA. >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >> >>
