[GSoC Proposal] Parallel Viterbi algorithm for HMM
--------------------------------------------------
Key: MAHOUT-652
URL: https://issues.apache.org/jira/browse/MAHOUT-652
Project: Mahout
Issue Type: New Feature
Components: Classification
Affects Versions: 0.4
Reporter: Sergey Bartunov
Proposal Title: Parallel Viterbi algorithm for HMM
Student Name: Sergey Bartunov
Student E-mail: [email protected]
Organization/Project: Apache Mahout
Assigned Mentor:
Proposal Abstract:
The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1].
It estimates the most likely sequence of hidden states called "Viterbi path"
from given sequence of observed states. Viterbi algorithm is also used in
Viterbi training which is less numerical stable than Baum-Welch algorithm but
faster and can be adjusted to have near the same accuracy [2]. Hidden Markov
Model is widely used for speech and gesture recognition, part-of-speech
tagging, bioinformatics etc. At the moment Apache Mahout contains only
sequential HMM functionality, and this project is intended to extend it by
implementing Map-Reduce version of Viterbi algorithm which would make Mahout
able to evaluate HMM on big amounts of data in parallel mode.
Detailed Description:
The Viterbi algorithms is a quite common dynamic programming algorithm, it's
well described in many sources such as [3], [4], [5]. As being popular and
needed, Viterbi algorithm already have parallel versions which became a subject
of several studies, for example, this paper about implementation for GPU [6].
Some of these papers may contain useful ideas for map-reduce implementation.
There are several possible strategies for parallelization (which one is
better is an open question for the project) and I'll discuss them in the
mailing list.
[Probably at least one strategy will be placed here later]
Timeline:
1. 2 weeks. Prepare a list of all reasonable parallelization strategies,
probably get some ideas from the papers, analyze them and discuss with the
mentor/community. Discover Mahout internals and best practices. Probably this
stage will be reduced during parallel discussions in the mailing list before
GSoC actually begins.
2. 4 weeks. Select the most proper one and implement it in Mahout code as a
Job. The implementation should work (or be compatible) with
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
3. 2 weeks. Collect debug dataset, write tests for the code and find possible
problems, fix them.
4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
5. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.
Additional Information:
The project relates to [this
proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about
implementing parallel Baum-Welch traning algorithm for HMM.
The important detail is that I have exams in the university until 16th June.
After that I will be ready to work on the project.
References:
[1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models
and selected applications in speech
recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf].
Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
[2] [Adjusted Viterbi Training. A proof of
concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
[3]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5] http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira