On Wed, Apr 6, 2011 at 9:42 AM, Grant Ingersoll <[email protected]> wrote:
> I think this looks really good. > Thanks! I'll be polishing it more today on Mahout-627, document any changes in the comments section and will correspondingly update the proposal on the GSOC site. Just a few quick questions: 1. Is it better to start writing a driver first, followed by the mapper, combiner and reducer or, write the driver at the end? Is it just a matter of personal style of a top down vs bottom up development or do these approaches have any tangible pros and cons in the context of iterative Map Reduce programming? 2. Is there a preferred IDE (Eclipse, IntelliJ or...)? I've been using Eclipse so far. 3. Is there a place where I can find guidelines for code formatting specific to Mahout? Things like indentation, class names, comments etc. Dhruv > > On Apr 4, 2011, at 6:34 PM, Dhruv Kumar wrote: > > > Here's a first draft of my proposal. It would be great if I can get any > > feedback before I submit it to GSOC and update Mahout-627. > > > > Thank you. > > > > Dhruv > > > > --------- Proposal Begin --------- > > > > Proposal Title: Implementation of the Baum-Welch Algorithm for parallel > > Hidden Markov Model training on Map-Reduce. > > > > Student Name: Dhruv Kumar > > > > Student E-mail: [email protected] > > > > Organization/Project: Apache Mahout > > > > Assigned Mentor: > > > > Proposal Abstract: > > > > The Baum-Welch algorithm is commonly used for training a Hidden Markov > > Model as it is numerically stable and provides a guaranteed discovery of > > the Maximum Likelihood Estimator in the presence of incomplete data. > > Currently, Apache Mahout has a sequential implementation of the > > Baum-Welch which cannot be scaled to train over large data-sets. This > > project proposes to extend the sequential implementation of the > > Baum-Welch to a parallel, distributed version using the Map Reduce > > programming framework to allow scalable Hidden Markov Model training. > > > > Detailed Description: > > > > Hidden Markov Models (HMM) are widely used as a probabilistic inference > > tool for applications generating temporal or spatial sequential data. > > Their relative simplicity of implementation and their ability to > > discover latent domain knowledge have made them very popular in fields > > such as DNA sequence alignment, handwriting analysis, voice recognition, > > computer vision and parts-of-speech tagging. > > > > A HMM is defined as a tuple (S, O, Theta) where S is a finite set of > > unobservable, hidden states emitting symbols from a finite observable > > vocabulary set O according to a probabilistic model Theta. The > > parameters of the model Theta are defined by the tuple (A, B, Pi) where > > A is a stochastic transition matrix of the hidden states of size |S| X > > |S|. The elements a_(i,j) of A specify the probability of transitioning > > from a state i to state j. Matrix B is a size |S| X |O| stochastic > > symbol emission matrix whose elements b_(s, o) provide the probability > > that a symbol o will be emitted from the hidden state s. The elements > > pi_(s) of the |S| length vector Pi determine the probability that the > > system starts in the hidden state s. The transitions of hidden states > > are unobservable and follow the Markov property of memorylessness. > > > > Rabiner [1] defined three main problems for HMMs: > > > > 1. Evaluation: Given the complete model (S, O, Theta) and a subset of > > the observation sequence, determine the probability that the model > > generated the observed sequence. This is useful for determining the > > quality of the model and is solved using the so called Forward > > algorithm. > > > > 2. Decoding: Given the complete model (S, O, Theta) and an observation > > sequence, determine the hidden state sequence which generated the > > observed sequence. This can be viewed as an inference problem where the > > model and observed sequence are used to predict the value of the > > unobservable random variables. The backward algorithm, also known as the > > Viterbi decoding algorithm is used for predicting the hidden state > > sequence. > > > > 3. Training: Given the set of hidden states S, the set of observation > > vocabulary O and the observation sequence, determine the parameters (A, > > B, Pi) of the model Theta. This problem can be viewed as a statistical > > machine learning problem of model fitting to a large set of training > > data. The Baum-Welch (BW) algorithm (also called the Forward-Backward > > algorithm) and the Viterbi training algorithm are commonly used for > > model fitting. > > > > Since most problems modeled by HMMs have a modest number of hidden and > > observable states, the sequential versions of the Forward and the > > Viterbi algorithms (currently implemented in Mahout) are sufficient for > > the evaluation and decoding purposes. However, often the training data > > is so large that a single compute node is incapable of handling it. > > Attempts to prune the training data can lead to lower quality model > > fitting and may miss out on crucial domain knowldege because of inferior > > posterior probabilities. Currently, Mahout only supports sequential HMM > > trainers--Viterbi and Baum Welch which are incapable of scaling to large > > data sets. > > > > This project proposes to extend Mahout's current sequential > > implementation of the Baum Welch HMM trainer to a scalable, distributed > > case. The distributed version of the BW will use the sequential > > implementations of the Forward and the Backward algorithms in each > > iteration, hence a lot of exisitng HMM training code will be reused. > > > > Among the two sequential HMM training methods, the Baum-Welch (BW) or > > the Forward-Backward algorithm is superiand a better candidate for a > > parallel implementation for two reasons. (1) The BW is more numerically > > stable and provides guaranteed discovery of Maximum Likelihood Estimator > > (MLE) (albiet a local maximum) unlike Viterbi training where the MLE is > > approximated in order to reduce computation time. (2) The BW belongs to > > the general class of Expectation Maximization (EM) algorithms which > > naturally fit into the Map Reduce framework [2]. Specifically, the > > parallel implementation of the BW algorithm on Map Reduce has been > > elaborated at great length in [3]. > > > > The BW EM algorithm iteratively refines the model's parameters and > > consists of two distinct steps in each iteration--Expectation (E) and > > Maximization (M). In the distributed case, the E step is computed by the > > mappers and the reducers, while the M is handled by the reducers. > > Starting from an initial Theta(0), in each iteration i, the model > > parameter tuple Theta(i) is input to the algorithm, and the end result > > Theta(i+1) is fed to the next iteration i+1. The iteration stops on a > > user specified convergence condition expressed as a fixpoint or when the > > number of iterations exceeds a user defined value. > > > > The E step computes the posterior probability of each latent varaiable > > for each observed variable, weighed by the relative frequency of the > > observered variable in the input split. The mappers process independent > > training instances and emit expected state transition and emission > > counts using the forward-backward algorithm. The reducers finish E by > > aggregating the expected counts. The input to a mapper consists of (k, > > v_o) pairs where k is a unique key and v_o is a string of observerd > > symbols of length n. For each training instance, the mappers emit the > > same set of keys corresponding to the following three optimization > > problems to be solved during the M: (1) expected number of times a > > hidden state is reached, (2) the number of times each observable symbol > > is generated by each hidden state and, (3) the number of transitions > > between each pair of states in the hidden state space. > > > > The M step computes the updated Theta(i+1) from the values generated > > during the E part. This involves aggregating the values obtained in the > > E step for each key corresponding to one of the optimization problems. > > The aggregation summarizes the statistics necessary to compute a subset > > of the parameters for the next EM iteration. The optimal parameters for > > the next iteration are arrived by computing the relative frequency of > > each event with respect to its expected count at the current iteration. > > The emitted optimal parameters by each reducer are written to the HDFS > > and are fed to the mappers in the next iteration. > > > > The project can be subdivided into distinct tasks of programming, > > testing and documenting the driver, mapper, reducer and the combiner. A > > list of milestones with their associated deliverables is given below. > > > > Timeline: April 26 - Aug 15. > > > > Milestones: > > > > April 26 - May 22 (4 weeks): Pre coding tasks. Open communication with > > my mentor, refine the project's plan and requirements, understand the > > code styling requirements, expand the knowledge on Hadoop and Mahout's > > internals. Thoroughly familiarize with the classes within the > > classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer > > and math packages. > > > > May 23 - June 3 (2 weeks): Driver. Implement and test the class > > HmmDriver by extending the AbstractJob class and by reusing the code > > from the KMeansDriver class. > > > > June 3 - July 1 (4 weeks): Mapper. Implement and test the class > > HmmMapper. The HmmMapper class will include setup() and map() methods. > > The setup() method will read in the HmmModel and the parameter values > > obtained from the previous iteration. The map() method will call the > > HmmAlgorithms.backwardAlgorithm() and the > > HmmAlgorithms.forwardAlgorithm() to compute the E step partially. > > > > July 1 - July 15 (2 weeks): Reducer. Implement and test the class > > HmmReducer. The reducer will complete the E step by summing over all the > > occurences emitted by the mappers for the three optimization problems. > > Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. > > > > July 15 - July 29 (2 weeks): Combiner. Implement and test the class > > HmmCombiner. The combiner will reduce the network traffic and improve > > efficiency by aggregating the values for each of the three keys > > corresponding to each of the optimization problems for the M stage. Look > > at the possibility of code reuse from KMeansCombiner. > > > > July 29 - August 15 (2 weeks): Give an example demonstrating the new > > parallel BW algorithm by employing the parts-of-speech tagger data set > > also used by the sequential BW. Tidy up code and fix loose ends, conduct > > final tests, finish any remaining documentation. > > > > Additional Information: > > > > I am in the final stages of finishing my Master's degree in Electrical > > and Computer Engineering from the University of Massachusetts Amherst. > > Working under the guidance of Prof. Wayne Burleson, as part of my > > Master's research work, I have applied the theory of Markov Decision > > Process (MDP) to increase the duration of service of mobile computers. > > This semester I am involved with two course projects involving machine > > learning over large data sets. In a Bioinformatics class I am mining the > > RCSB Protein Data Bank to learn the dependence of side chain geometry on > > the protein's secondary structure. In another project for the Online > > Social Networks class, I am building an online recommendation system by > > using the MDP theory and recasting the problem to be an instance of > > sequential optimal control. > > > > I owe much to the open source community as all my research experiments > > have only been possible due to the freely available Linux distributions, > > open source performance analyzers, scripting languages such as Python > > and the suite of GNU tools. I have found the Apache Mahout's developer > > community vibrant, helpful and welcoming. If selected, I feel that the > > GSOC 2011 project will be a great learning experience for me from both a > > technical and professional standpoint and allow me to contribute within > > my modest means to the overall spirit of open coding. > > > > References: > > > > [1] A tutorial on hidden markov models and selected > > applications in speech recognition by Lawrence R. Rabiner. In > > Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286. > > > > [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang > > K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle > > Olukotun. In NIPS (2006), pp. 281-288. > > > > [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris > > Dyer. Morgan & Claypool 2010. > > > > > > --------- Proposal End ------------- > > > > > > On Thu, Mar 24, 2011 at 8:08 PM, Ted Dunning <[email protected]> > wrote: > > > >> I would invert the order of 1 and 2 and estimate the times as 30, 30, > 40. > >> You can view this from a few points of view: > >> > >> a) writing good tests is hard > >> > >> b) writing good tests makes writing code much easier > >> > >> c) writing any kind of decent documentation is VASTLY harder than most > >> people think. It is also very important for user adoption, mentor > >> satisfaction and personal marketing (for instance a resume or portfolio) > >> > >> On Thu, Mar 24, 2011 at 4:41 PM, Dhruv Kumar <[email protected]> > wrote: > >> > >>> Thanks Ted, I'll start working on a proposal having the following sub > >> tasks > >>> (I have given a rudimentary percent time estimate, please feel free to > >>> suggest alterations): > >>> > >>> 1. Implementing the BW on Map Reduce following the line of k-means. > Focus > >>> on > >>> re-using as much of the existing k-means code as possible. (60%) > >>> > >>> 2. Unit testing the Mapper, Combiner, Reducer and testing the > >> integration, > >>> in local and pseudo-distributed modes. I may be able to get access to a > >>> small cluster at UMass for unit testing in the real-distributed mode. > >> (35%) > >>> > >>> 3. Writing clear documentation directing clients how to use the > >> implemented > >>> library code for their needs. (5%) > >>> > >>> > >>> > >>> On Thu, Mar 24, 2011 at 6:45 PM, Ted Dunning <[email protected]> > >>> wrote: > >>> > >>>> On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <[email protected]> > >>> wrote: > >>>> > >>>>> 2. Another very interesting possibility is to express the BW as a > >>>> recursive > >>>>> join. There's a very interesting offshoot of Hadoop, called Haloop ( > >>>>> http://code.google.com/p/haloop/) which supports loop control, and > >>>> caching > >>>>> of the intermediate results on the mapper inputs, reducer inputs and > >>>>> reducer outputs to improve performance. The paper [1] describes this > >> in > >>>>> more > >>>>> detail. They have implemented k-means as a recursive join. > >>>>> > >>>> > >>>> Until there is flexibility around execution model such as the recent > >>>> map-reduce 2.0 announcement > >>>> from Yahoo and until that flexibility is pretty much standard, it is > >> hard > >>>> to > >>>> justify this. > >>>> > >>>> The exception is where such extended capabilities fit into standard > >>> hadoop > >>>> 0.20 environments. > >>>> > >>> > >>>> > >>>>> In either case, I want to clearly define the scope and task list. BW > >>> will > >>>>> be > >>>>> the core of the project but: > >>>>> > >>>>> 1. Does it make sense for implementing the "counting method" for > >> model > >>>>> discovery as well? It is clearly inferior but will it be a good > >>> reference > >>>>> for comparison to the BW. Any added benefit? > >>>>> > >>>> > >>>> No opinion here except that increased scope decreases probability of > >> even > >>>> partial success. > >>>> > >>>> > >>>>> 2. What has been the standard in the past GSoC Mahout projects > >>> regarding > >>>>> unit testing and documentation? > >>>>> > >>>> > >>>> Do it. > >>>> > >>>> Seriously. > >>>> > >>>> We use junit 4+ and very much prefer strong unit tests. Nothing in > >> what > >>>> you > >>>> are proposing should > >>>> require anything interesting in this regard. Testing the mapper, > >>> combiner > >>>> and reducer in isolation is > >>>> good. Testing the integrated program in local mode or pseudo > >> distributed > >>>> mode should suffice beyond > >>>> that. It is best if you can separate command line argument parsing > >> from > >>>> execution path to that you > >>>> can test them separately. > >>>> > >>>>> > >>>>> In the meantime, I've been understanding more about Mahout, Map > >> Reduce > >>>> and > >>>>> Hadoop's internals. One of my course projects this semester is to > >>>> implement > >>>>> the Bellman Iteration algorithm on Map Reduce and so far it has been > >>>> coming > >>>>> along well. > >>>>> > >>>>> Any feedback is much appreciated. > >>>>> > >>>>> Dhruv > >>>>> > >>>> > >>> > >> > > -------------------------- > Grant Ingersoll > Lucene Revolution -- Lucene and Solr User Conference > May 25-26 in San Francisco > www.lucenerevolution.org > >
