[
https://issues.apache.org/jira/browse/MAHOUT-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054870#comment-13054870
]
Grant Ingersoll commented on MAHOUT-652:
----------------------------------------
Awesome! How goes the testing?
> [GSoC Proposal] Parallel Viterbi algorithm for HMM
> --------------------------------------------------
>
> Key: MAHOUT-652
> URL: https://issues.apache.org/jira/browse/MAHOUT-652
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.5
> Reporter: Sergey Bartunov
> Assignee: Isabel Drost
> Labels: classification, gsoc,, gsoc11, hmm, mahout-gsoc-11,
> sequence-learning
> Fix For: 0.6
>
>
> Proposal Title: Parallel Viterbi algorithm for HMM
> Student Name: Sergey Bartunov
> Student E-mail: [email protected]
>
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1].
> It estimates the most likely sequence of hidden states called "Viterbi path"
> from given sequence of observed states. Viterbi algorithm is also used in
> Viterbi training which is less numerical stable than Baum-Welch algorithm but
> faster and can be adjusted to have near the same accuracy [2]. Hidden Markov
> Model is widely used for speech and gesture recognition, part-of-speech
> tagging, bioinformatics etc. At the moment Apache Mahout contains only
> sequential HMM functionality, and this project is intended to extend it by
> implementing Map-Reduce version of Viterbi algorithm which would make Mahout
> able to evaluate HMM on big amounts of data in parallel mode.
> Detailed Description:
> The Viterbi algorithms is a quite common dynamic programming algorithm, it's
> well described in many sources such as [3], [4], [5]. As being popular and
> needed, Viterbi algorithm already have parallel versions which became a
> subject of several studies, for example, this paper about implementation for
> GPU [6]. Some of these papers may contain useful ideas for map-reduce
> implementation.
> There are several possible strategies for parallelization (which one is
> better is an open question for the project) I discovered and posted to the
> mailing list.
> See [this
> thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]
> The most reasonable one is:
> Assume we have O - set of oberved state sequences (that's our input data),
> $O_i$ is i-th sequence with length
> len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the
> observed state of i-th sequence at the moment of time t. Let the N be the
> maximum of len($O_i$).
> Let's write all the seqences one above other. We will get the matrix which
> rows are the input seqeunces. Then take the first L columns of this matrix
> and name them the "first chunk", then next L columns and so on. Thus we
> divide our input into the chunks of size L (less or equal to L). L is the
> parameter of the algorithm. So the n-th chunk contain subsequences with $t$
> in the iterval [L*n, L*(n+1)).
> The key idea is to process each chunk by standard sequential Viterbi
> algorithm (the existing Mahout code could be reused here) in parallel mode
> one by one using results of previous chunk processing.
> It will require about O((M/P) * K^2) time where M is sum of all sequence
> lengths if most of input sequences have near the same length (the good case).
> P here is the number of "cores" / computational nodes. This time complexity
> means that such strategy scales well.
> Here is Map-Reduce scheme for the good case:
> * Map emits (subsequence, probabilites, paths) vector for each subsequence in
> the chunk, where probabilities is the initial probabilites in case of the
> first chunk and optimal-path probabilities in all other cases, paths means
> optimal paths to the first observations of each subsequence.
> * Reduce function just performs sequential Viterbi algorithm on these data
> and returns (probabilities, paths).
> * Driver class runs the Map then Reduce for the first chunk, then Map and
> Reduce for the second, and so on. Then provide the results (probably using
> another map-reduce combination).
> Probably, it's possible to use lazy viterbi [7,8] instead of the standard
> Viterbi for chunk processing.
> Let's focus now on all other cases when first T chunks contain the
> subsequences with near the same length and all other N-T chunks do not. Well,
> then they could be processed using the next technique (which is strategy
> number 2 in [the
> list|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]):
> At first, denote $V_{t,k}$ as the probabilty that optimal hidden state path
> goes through hidden state $k$ at the moment $t$. We need such probabilities
> to select the most likely (or optimal) path so we need to compute $V_{t,k}$
> for every $t$ and $k$ (that is the part of standard Viterbi). I omit here the
> $i$ index of the sequence, because each seqence is the separate task.
> 1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval,
> where n is the number of chunk and to compute Actually to do this, we need
> to compute $V_{t,k}$ we need to know $V_{L*n-1,k1}$ for each $k1$. But this
> value belong to the chunk the previous number (n-1) which is processed
> separately. We need to obtain the max and arg-max of $V_{L*n-1, k1}$ for each
> k1.
> 2) Since we don't know the max's and arg-max's let's just assume that optimal
> path goes through the k1=1 then through k2=2 and so on up to the K, and
> $V_{L*n-1, k1}$ = 1 (or 0 if we use log(prob) adding instead of prob
> multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the
> $V_{L*(n+1)-1, k1}$ by the precise probability max{ $V_{L*n-1, k1}$ } (or add
> the log of it). Thus we get K^2 possibly optimal paths instead of K.
> 3) After we process all the chunks we start "connecting" phase.
> The first of these N-T chunks could be processed by using usual Viterbi
> processing since we know the results for the previous "good" chunk. All
> others need to be connected each to other starting from connecting second
> this first then 3th with 4th and so on. During connection process we throw
> away all non-optimal paths (know we know which are optimal and which are not)
> and leave just K possibly optimal paths.
> To handle such difficult cases the Driver class should know the sequence
> lengths to recognize which Map-Reduce combination it should use. The
> resulting time complexity for difficult case is
> O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3). The first term of this
> sum is time for computing the first T chunks using simple case approach, and
> the second term is time for computing the rest using difficult case approach.
> When T -> N overall time tend to O((M/P) * K^2). T * L / N is about total
> length of subsequences in the first T chunks and (((T-N) * L / N)) is total
> length of the rest.
> The only problem not discussed here is how to deal with obtained most likely
> paths. Path length is equal to length of a sequence, so it's nood a good idea
> to emit optimal path in Map stage. They could be stored separately in the
> files or HBase since they are required only in getting the results to the
> user.
> This implementation should work (or be compatible) with
> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
> As you see, Viterbi algorithm use very common dynamic approach. It computes
> the next computation layer by combining values the previous, that's all we
> need to know how to implement it in the Map-Reduce and it shares the
> mentioned difficulties with all other such dynamic algorithms, so the
> implentation may be highly resuable.
> Timeline:
> 1. 1 week. Discover Mahout internals, sequential HMM code and code best
> practices. Think on problem with path storing. I'll work on this stage
> actually before GSoC starts but I leave here 1 week just to be sure.
> 2. 5 days. Write the chunk division routines.
> 3. 5 days. Write the SimpleMapper class for simple good case of data.
> 4. 1 week. Write the SimpleReducer class for simple good case.
> 5. 1 week. Write the Driver class for simple good case which will use
> SimpleMapper and SimpleReducer.
> 6. 1 week. Collect debug dataset, write tests for the code and find possible
> problems, fix them.
> 7. 10 days. Write the HardReducer class for difficult bad case and rewrite
> Driver class to handle such cases. Perform tests, ensure that everything
> works.
> 7. 10 days. Try to get a large dataset to test performance, scalability etc,
> write and perform such a test (here a community help might be needed).
> If there are any problems discuss and analyze them, then fix.
> 8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy Viterbi)
> in chunk processing phase, measure speed improvement especially for difficult
> cases.
> 9. 1 week. Write documentation for all classes, description of algorithm in
> wiki and complete example of usage.
> Additional Information:
> The project relates to [this
> proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about
> implementing parallel Baum-Welch traning algorithm for HMM.
> My motivation for this project is that I need such a functionality as a
> regular Mahout user since I face HMM often in my research work. Also I'm very
> interested in what kinds or classes of algorithms could be efficiently
> implemented in Map-Reduce, I also think that this is important question for
> the community and such an experience is significant.
> The important detail is that I have exams in the university until 16th June.
> After that I will be ready to work on the project.
> References:
> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models
> and selected applications in speech
> recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf].
> Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
> [2] [Adjusted Viterbi Training. A proof of
> concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
> [3]
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
> [5] http://www.kanungo.com/software/hmmtut.pdf
> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
> [7] http://www.scribd.com/doc/48568001/Lazy-Viterbi-Slides
> [8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira