Do you already have some RE:"1. 2 weeks. Prepare a list of all reasonable parallelization strategies" ?
Thanks. -D On Wed, Apr 6, 2011 at 1:44 PM, Sergey Bartunov (JIRA) <[email protected]> wrote: > [GSoC Proposal] Parallel Viterbi algorithm for HMM > -------------------------------------------------- > > Key: MAHOUT-652 > URL: https://issues.apache.org/jira/browse/MAHOUT-652 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.4 > Reporter: Sergey Bartunov > > > Proposal Title: Parallel Viterbi algorithm for HMM > > Student Name: Sergey Bartunov > > Student E-mail: [email protected] > > > > Organization/Project: Apache Mahout > > Assigned Mentor: > > > Proposal Abstract: > > The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. > It estimates the most likely sequence of hidden states called "Viterbi path" > from given sequence of observed states. Viterbi algorithm is also used in > Viterbi training which is less numerical stable than Baum-Welch algorithm but > faster and can be adjusted to have near the same accuracy [2]. Hidden Markov > Model is widely used for speech and gesture recognition, part-of-speech > tagging, bioinformatics etc. At the moment Apache Mahout contains only > sequential HMM functionality, and this project is intended to extend it by > implementing Map-Reduce version of Viterbi algorithm which would make Mahout > able to evaluate HMM on big amounts of data in parallel mode. > > Detailed Description: > > The Viterbi algorithms is a quite common dynamic programming algorithm, it's > well described in many sources such as [3], [4], [5]. As being popular and > needed, Viterbi algorithm already have parallel versions which became a > subject of several studies, for example, this paper about implementation for > GPU [6]. Some of these papers may contain useful ideas for map-reduce > implementation. > > There are several possible strategies for parallelization (which one is > better is an open question for the project) and I'll discuss them in the > mailing list. > > [Probably at least one strategy will be placed here later] > > Timeline: > 1. 2 weeks. Prepare a list of all reasonable parallelization strategies, > probably get some ideas from the papers, analyze them and discuss with the > mentor/community. Discover Mahout internals and best practices. Probably this > stage will be reduced during parallel discussions in the mailing list before > GSoC actually begins. > 2. 4 weeks. Select the most proper one and implement it in Mahout code as a > Job. The implementation should work (or be compatible) with > org.apache.mahout.classifier.sequencelearning.hmm.HmmModel > 3. 2 weeks. Collect debug dataset, write tests for the code and find possible > problems, fix them. > 4. 2 weeks. Try to get a large dataset to test performance, scalability etc, > write and perform such a test (here a community help might be needed). > If there are any problems discuss and analyze them, then fix. > 5. 1 week. Write documentation for all classes, description of algorithm in > wiki and complete example of usage. > > Additional Information: > > The project relates to [this > proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about > implementing parallel Baum-Welch traning algorithm for HMM. > > The important detail is that I have exams in the university until 16th June. > After that I will be ready to work on the project. > > References: > > [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models > and selected applications in speech > recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. > Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. > [2] [Adjusted Viterbi Training. A proof of > concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf] > [3] > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf > [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf > [5] http://www.kanungo.com/software/hmmtut.pdf > [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903 > > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira >
