Check the mailling list. Just sent some ideas there.
On 7 April 2011 01:10, Dmitriy Lyubimov <[email protected]> wrote: > Do you already have some RE:"1. 2 weeks. Prepare a list of all > reasonable parallelization strategies" ? > > Thanks. > -D > > On Wed, Apr 6, 2011 at 1:44 PM, Sergey Bartunov (JIRA) <[email protected]> > wrote: >> [GSoC Proposal] Parallel Viterbi algorithm for HMM >> -------------------------------------------------- >> >> Key: MAHOUT-652 >> URL: https://issues.apache.org/jira/browse/MAHOUT-652 >> Project: Mahout >> Issue Type: New Feature >> Components: Classification >> Affects Versions: 0.4 >> Reporter: Sergey Bartunov >> >> >> Proposal Title: Parallel Viterbi algorithm for HMM >> >> Student Name: Sergey Bartunov >> >> Student E-mail: [email protected] >> >> >> >> Organization/Project: Apache Mahout >> >> Assigned Mentor: >> >> >> Proposal Abstract: >> >> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model >> [1]. It estimates the most likely sequence of hidden states called "Viterbi >> path" from given sequence of observed states. Viterbi algorithm is also used >> in Viterbi training which is less numerical stable than Baum-Welch algorithm >> but faster and can be adjusted to have near the same accuracy [2]. Hidden >> Markov Model is widely used for speech and gesture recognition, >> part-of-speech tagging, bioinformatics etc. At the moment Apache Mahout >> contains only sequential HMM functionality, and this project is intended to >> extend it by implementing Map-Reduce version of Viterbi algorithm which >> would make Mahout able to evaluate HMM on big amounts of data in parallel >> mode. >> >> Detailed Description: >> >> The Viterbi algorithms is a quite common dynamic programming algorithm, it's >> well described in many sources such as [3], [4], [5]. As being popular and >> needed, Viterbi algorithm already have parallel versions which became a >> subject of several studies, for example, this paper about implementation for >> GPU [6]. Some of these papers may contain useful ideas for map-reduce >> implementation. >> >> There are several possible strategies for parallelization (which one is >> better is an open question for the project) and I'll discuss them in the >> mailing list. >> >> [Probably at least one strategy will be placed here later] >> >> Timeline: >> 1. 2 weeks. Prepare a list of all reasonable parallelization strategies, >> probably get some ideas from the papers, analyze them and discuss with the >> mentor/community. Discover Mahout internals and best practices. Probably >> this stage will be reduced during parallel discussions in the mailing list >> before GSoC actually begins. >> 2. 4 weeks. Select the most proper one and implement it in Mahout code as a >> Job. The implementation should work (or be compatible) with >> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel >> 3. 2 weeks. Collect debug dataset, write tests for the code and find >> possible problems, fix them. >> 4. 2 weeks. Try to get a large dataset to test performance, scalability etc, >> write and perform such a test (here a community help might be needed). >> If there are any problems discuss and analyze them, then fix. >> 5. 1 week. Write documentation for all classes, description of algorithm in >> wiki and complete example of usage. >> >> Additional Information: >> >> The project relates to [this >> proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about >> implementing parallel Baum-Welch traning algorithm for HMM. >> >> The important detail is that I have exams in the university until 16th June. >> After that I will be ready to work on the project. >> >> References: >> >> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov >> Models and selected applications in speech >> recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. >> Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. >> [2] [Adjusted Viterbi Training. A proof of >> concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf] >> [3] >> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf >> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf >> [5] http://www.kanungo.com/software/hmmtut.pdf >> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903 >> >> -- >> This message is automatically generated by JIRA. >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >
