Re: [jira] [Created] (MAHOUT-652) [GSoC Proposal] Parallel Viterbi algorithm for HMM

Sergey Bartunov Wed, 06 Apr 2011 14:38:33 -0700

Check the mailling list. Just sent some ideas there.


On 7 April 2011 01:10, Dmitriy Lyubimov <[email protected]> wrote:
> Do you already have some RE:"1. 2 weeks. Prepare a list of all
> reasonable parallelization strategies" ?
>
> Thanks.
> -D
>
> On Wed, Apr 6, 2011 at 1:44 PM, Sergey Bartunov (JIRA) <[email protected]> 
> wrote:
>> [GSoC Proposal] Parallel Viterbi algorithm for HMM
>> --------------------------------------------------
>>
>>                 Key: MAHOUT-652
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-652
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Classification
>>    Affects Versions: 0.4
>>            Reporter: Sergey Bartunov
>>
>>
>> Proposal Title: Parallel Viterbi algorithm for HMM
>>
>> Student Name: Sergey Bartunov
>>
>> Student E-mail: [email protected]
>>
>>
>>
>> Organization/Project: Apache Mahout
>>
>> Assigned Mentor:
>>
>>
>> Proposal Abstract:
>>
>> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model 
>> [1]. It estimates the most likely sequence of hidden states called "Viterbi 
>> path" from given sequence of observed states. Viterbi algorithm is also used 
>> in Viterbi training which is less numerical stable than Baum-Welch algorithm 
>> but faster and can be adjusted to have near the same accuracy [2]. Hidden 
>> Markov Model is widely used for speech and gesture recognition, 
>> part-of-speech tagging, bioinformatics etc. At the moment Apache Mahout 
>> contains only sequential HMM functionality, and this project is intended to 
>> extend it by implementing Map-Reduce version of Viterbi algorithm which 
>> would make Mahout able to evaluate HMM on big amounts of data in parallel 
>> mode.
>>
>> Detailed Description:
>>
>> The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
>> well described in many sources such as [3], [4], [5]. As being popular and 
>> needed, Viterbi algorithm already have parallel versions which became a 
>> subject of several studies, for example, this paper about implementation for 
>> GPU [6]. Some of these papers may contain useful ideas for map-reduce 
>> implementation.
>>
>> There are several possible strategies for parallelization (which one is
>> better is an open question for the project) and I'll discuss them in the 
>> mailing list.
>>
>> [Probably at least one strategy will be placed here later]
>>
>> Timeline:
>> 1. 2 weeks. Prepare a list of all reasonable parallelization strategies, 
>> probably get some ideas from the papers, analyze them and discuss with the 
>> mentor/community. Discover Mahout internals and best practices. Probably 
>> this stage will be reduced during parallel discussions in the mailing list 
>> before GSoC actually begins.
>> 2. 4 weeks. Select the most proper one and implement it in Mahout code as a
>> Job. The implementation should work (or be compatible) with
>> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
>> 3. 2 weeks. Collect debug dataset, write tests for the code and find 
>> possible problems, fix them.
>> 4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
>> write and perform such a test (here a community help might be needed).
>> If there are any problems discuss and analyze them, then fix.
>> 5. 1 week. Write documentation for all classes, description of algorithm in
>> wiki and complete example of usage.
>>
>> Additional Information:
>>
>> The project relates to [this 
>> proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
>> implementing parallel Baum-Welch traning algorithm for HMM.
>>
>> The important detail is that I have exams in the university until 16th June. 
>> After that I will be ready to work on the project.
>>
>> References:
>>
>> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov 
>> Models and selected applications in speech 
>> recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
>> Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
>> [2] [Adjusted Viterbi Training. A proof of 
>> concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
>> [3] 
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
>> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
>> [5]  http://www.kanungo.com/software/hmmtut.pdf
>> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
>>
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>

Re: [jira] [Created] (MAHOUT-652) [GSoC Proposal] Parallel Viterbi algorithm for HMM

Reply via email to