Do you already have some RE:"1. 2 weeks. Prepare a list of all
reasonable parallelization strategies" ?

Thanks.
-D

On Wed, Apr 6, 2011 at 1:44 PM, Sergey Bartunov (JIRA) <[email protected]> wrote:
> [GSoC Proposal] Parallel Viterbi algorithm for HMM
> --------------------------------------------------
>
>                 Key: MAHOUT-652
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-652
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Sergey Bartunov
>
>
> Proposal Title: Parallel Viterbi algorithm for HMM
>
> Student Name: Sergey Bartunov
>
> Student E-mail: [email protected]
>
>
>
> Organization/Project: Apache Mahout
>
> Assigned Mentor:
>
>
> Proposal Abstract:
>
> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
> It estimates the most likely sequence of hidden states called "Viterbi path" 
> from given sequence of observed states. Viterbi algorithm is also used in 
> Viterbi training which is less numerical stable than Baum-Welch algorithm but 
> faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
> Model is widely used for speech and gesture recognition, part-of-speech 
> tagging, bioinformatics etc. At the moment Apache Mahout contains only 
> sequential HMM functionality, and this project is intended to extend it by 
> implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
> able to evaluate HMM on big amounts of data in parallel mode.
>
> Detailed Description:
>
> The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
> well described in many sources such as [3], [4], [5]. As being popular and 
> needed, Viterbi algorithm already have parallel versions which became a 
> subject of several studies, for example, this paper about implementation for 
> GPU [6]. Some of these papers may contain useful ideas for map-reduce 
> implementation.
>
> There are several possible strategies for parallelization (which one is
> better is an open question for the project) and I'll discuss them in the 
> mailing list.
>
> [Probably at least one strategy will be placed here later]
>
> Timeline:
> 1. 2 weeks. Prepare a list of all reasonable parallelization strategies, 
> probably get some ideas from the papers, analyze them and discuss with the 
> mentor/community. Discover Mahout internals and best practices. Probably this 
> stage will be reduced during parallel discussions in the mailing list before 
> GSoC actually begins.
> 2. 4 weeks. Select the most proper one and implement it in Mahout code as a
> Job. The implementation should work (or be compatible) with
> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
> 3. 2 weeks. Collect debug dataset, write tests for the code and find possible 
> problems, fix them.
> 4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
> write and perform such a test (here a community help might be needed).
> If there are any problems discuss and analyze them, then fix.
> 5. 1 week. Write documentation for all classes, description of algorithm in
> wiki and complete example of usage.
>
> Additional Information:
>
> The project relates to [this 
> proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
> implementing parallel Baum-Welch traning algorithm for HMM.
>
> The important detail is that I have exams in the university until 16th June. 
> After that I will be ready to work on the project.
>
> References:
>
> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
> and selected applications in speech 
> recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
> Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
> [2] [Adjusted Viterbi Training. A proof of 
> concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
> [3] 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
> [5]  http://www.kanungo.com/software/hmmtut.pdf
> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>

Reply via email to