[GSoC Proposal] Parallel Viterbi algorithm for HMM
--------------------------------------------------

                 Key: MAHOUT-652
                 URL: https://issues.apache.org/jira/browse/MAHOUT-652
             Project: Mahout
          Issue Type: New Feature
          Components: Classification
    Affects Versions: 0.4
            Reporter: Sergey Bartunov


Proposal Title: Parallel Viterbi algorithm for HMM

Student Name: Sergey Bartunov

Student E-mail: [email protected]

 

Organization/Project: Apache Mahout

Assigned Mentor:


Proposal Abstract:

The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
It estimates the most likely sequence of hidden states called "Viterbi path" 
from given sequence of observed states. Viterbi algorithm is also used in 
Viterbi training which is less numerical stable than Baum-Welch algorithm but 
faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
Model is widely used for speech and gesture recognition, part-of-speech 
tagging, bioinformatics etc. At the moment Apache Mahout contains only 
sequential HMM functionality, and this project is intended to extend it by 
implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
able to evaluate HMM on big amounts of data in parallel mode. 

Detailed Description:

The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
well described in many sources such as [3], [4], [5]. As being popular and 
needed, Viterbi algorithm already have parallel versions which became a subject 
of several studies, for example, this paper about implementation for GPU [6]. 
Some of these papers may contain useful ideas for map-reduce implementation.

There are several possible strategies for parallelization (which one is
better is an open question for the project) and I'll discuss them in the 
mailing list.

[Probably at least one strategy will be placed here later]

Timeline:
1. 2 weeks. Prepare a list of all reasonable parallelization strategies, 
probably get some ideas from the papers, analyze them and discuss with the 
mentor/community. Discover Mahout internals and best practices. Probably this 
stage will be reduced during parallel discussions in the mailing list before 
GSoC actually begins.
2. 4 weeks. Select the most proper one and implement it in Mahout code as a
Job. The implementation should work (or be compatible) with
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
3. 2 weeks. Collect debug dataset, write tests for the code and find possible 
problems, fix them.
4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
5. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.

Additional Information:

The project relates to [this 
proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
implementing parallel Baum-Welch traning algorithm for HMM.

The important detail is that I have exams in the university until 16th June. 
After that I will be ready to work on the project.

References:

[1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
and selected applications in speech 
recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. 
[2] [Adjusted Viterbi Training. A proof of 
concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
[3] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5]  http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to