[
https://issues.apache.org/jira/browse/MAHOUT-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Bartunov updated MAHOUT-652:
-----------------------------------
Description:
Proposal Title: Parallel Viterbi algorithm for HMM
Student Name: Sergey Bartunov
Student E-mail: [email protected]
Organization/Project: Apache Mahout
Assigned Mentor:
Proposal Abstract:
The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1].
It estimates the most likely sequence of hidden states called "Viterbi path"
from given sequence of observed states. Viterbi algorithm is also used in
Viterbi training which is less numerical stable than Baum-Welch algorithm but
faster and can be adjusted to have near the same accuracy [2]. Hidden Markov
Model is widely used for speech and gesture recognition, part-of-speech
tagging, bioinformatics etc. At the moment Apache Mahout contains only
sequential HMM functionality, and this project is intended to extend it by
implementing Map-Reduce version of Viterbi algorithm which would make Mahout
able to evaluate HMM on big amounts of data in parallel mode.
Detailed Description:
The Viterbi algorithms is a quite common dynamic programming algorithm, it's
well described in many sources such as [3], [4], [5]. As being popular and
needed, Viterbi algorithm already have parallel versions which became a subject
of several studies, for example, this paper about implementation for GPU [6].
Some of these papers may contain useful ideas for map-reduce implementation.
There are several possible strategies for parallelization (which one is
better is an open question for the project) and I'll discuss them in the
mailing list.
See [this
thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]
One possible strategy from the above mentioned thread:
Assume we have O - set of oberved state sequences (that's our input data),
$O_i$ is i-th sequence with length
len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the
observed state of i-th sequence at the moment of time t.
Let's write all the seqences one above other. Then take the first L columns of
such a matrix, then next L columns and so on. Thus we divide our input into the
chunks of size L. So the n-th chunk contain subsequences with $t$ in the
iterval [L*
Denote $V_{t,k}$ as the probabilty that optimal hidden state path is going thru
hidden state $k$ at the moment $t$. We need such probabilities to select the
most likely (I call it "optimal") path. To do this we need to know $V_{t,k}$
for every $t$ and $k$.
Let's try to compute $V_{t,k}$ in each chunk indiependetly for each sequence in
the chunk. I omit here the $i$ index of the sequence, because each seqence is
the separate task. $t$ will lie in the [L*n, L*(n+1)) interval, where n is the
number of chunk.
The first chunk could be processed using usual Viterbi algorithm.
But to compute i.e. $V_{L*n,k}$ in any other chunk we need to know $V_{L*n-1,
k1}$ for each
k1 that is to know the value from the previous chunk which is computed
separately. We need it because we want to obtain the max and arg-max of
$V_{L*n-1, k1}$ for each k1.
At this time there are two alternatives
1) we may compute all possible subpaths into the chunk by assuming that arg-max
is 1, then that is 2, 3 and so on to K. It will requite O(L*K^3) time for each
chunk. After we get all possible subpaths we may throw away not optimal ones
using the result of the first chunk processing, then using result of the second
and so on. Time complexity is O( (M/P) * K^3) for overall process.
2) we may not process chunks separately and in parallel. Let's just wait until
first chunk is finished, then use the results to compute the second chunk and
so on. It will require about O((M/P) * K^2) time. This strategy will work very
well if we have a lot of sequences with near the same length. Workaraounds for
cases when lengths differ highly are to be investigated.
Probably, it's possible to use Fast / lazy viterbi or some monte-carlo
technique, I'll think about it in the near future.
Timeline:
1. 2 weeks. Prepare a list of all reasonable parallelization strategies,
probably get some ideas from the papers, analyze them and discuss with the
mentor/community. Discover Mahout internals and best practices. Probably this
stage will be reduced during parallel discussions in the mailing list before
GSoC actually begins.
2. 4 weeks. Select the most proper one and implement it in Mahout code as a
Job. The implementation should work (or be compatible) with
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
3. 2 weeks. Collect debug dataset, write tests for the code and find possible
problems, fix them.
4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
5. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.
Additional Information:
The project relates to [this
proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about
implementing parallel Baum-Welch traning algorithm for HMM.
The important detail is that I have exams in the university until 16th June.
After that I will be ready to work on the project.
References:
[1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models
and selected applications in speech
recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf].
Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
[2] [Adjusted Viterbi Training. A proof of
concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
[3]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5] http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
was:
Proposal Title: Parallel Viterbi algorithm for HMM
Student Name: Sergey Bartunov
Student E-mail: [email protected]
Organization/Project: Apache Mahout
Assigned Mentor:
Proposal Abstract:
The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1].
It estimates the most likely sequence of hidden states called "Viterbi path"
from given sequence of observed states. Viterbi algorithm is also used in
Viterbi training which is less numerical stable than Baum-Welch algorithm but
faster and can be adjusted to have near the same accuracy [2]. Hidden Markov
Model is widely used for speech and gesture recognition, part-of-speech
tagging, bioinformatics etc. At the moment Apache Mahout contains only
sequential HMM functionality, and this project is intended to extend it by
implementing Map-Reduce version of Viterbi algorithm which would make Mahout
able to evaluate HMM on big amounts of data in parallel mode.
Detailed Description:
The Viterbi algorithms is a quite common dynamic programming algorithm, it's
well described in many sources such as [3], [4], [5]. As being popular and
needed, Viterbi algorithm already have parallel versions which became a subject
of several studies, for example, this paper about implementation for GPU [6].
Some of these papers may contain useful ideas for map-reduce implementation.
There are several possible strategies for parallelization (which one is
better is an open question for the project) and I'll discuss them in the
mailing list.
[Probably at least one strategy will be placed here later]
Timeline:
1. 2 weeks. Prepare a list of all reasonable parallelization strategies,
probably get some ideas from the papers, analyze them and discuss with the
mentor/community. Discover Mahout internals and best practices. Probably this
stage will be reduced during parallel discussions in the mailing list before
GSoC actually begins.
2. 4 weeks. Select the most proper one and implement it in Mahout code as a
Job. The implementation should work (or be compatible) with
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
3. 2 weeks. Collect debug dataset, write tests for the code and find possible
problems, fix them.
4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
5. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.
Additional Information:
The project relates to [this
proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about
implementing parallel Baum-Welch traning algorithm for HMM.
The important detail is that I have exams in the university until 16th June.
After that I will be ready to work on the project.
References:
[1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models
and selected applications in speech
recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf].
Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
[2] [Adjusted Viterbi Training. A proof of
concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
[3]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5] http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
> [GSoC Proposal] Parallel Viterbi algorithm for HMM
> --------------------------------------------------
>
> Key: MAHOUT-652
> URL: https://issues.apache.org/jira/browse/MAHOUT-652
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Affects Versions: 0.4
> Reporter: Sergey Bartunov
> Labels: classification, gsoc,, gsoc11, hmm, mahout-gsoc-11,
> sequence-learning
>
> Proposal Title: Parallel Viterbi algorithm for HMM
> Student Name: Sergey Bartunov
> Student E-mail: [email protected]
>
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1].
> It estimates the most likely sequence of hidden states called "Viterbi path"
> from given sequence of observed states. Viterbi algorithm is also used in
> Viterbi training which is less numerical stable than Baum-Welch algorithm but
> faster and can be adjusted to have near the same accuracy [2]. Hidden Markov
> Model is widely used for speech and gesture recognition, part-of-speech
> tagging, bioinformatics etc. At the moment Apache Mahout contains only
> sequential HMM functionality, and this project is intended to extend it by
> implementing Map-Reduce version of Viterbi algorithm which would make Mahout
> able to evaluate HMM on big amounts of data in parallel mode.
> Detailed Description:
> The Viterbi algorithms is a quite common dynamic programming algorithm, it's
> well described in many sources such as [3], [4], [5]. As being popular and
> needed, Viterbi algorithm already have parallel versions which became a
> subject of several studies, for example, this paper about implementation for
> GPU [6]. Some of these papers may contain useful ideas for map-reduce
> implementation.
> There are several possible strategies for parallelization (which one is
> better is an open question for the project) and I'll discuss them in the
> mailing list.
> See [this
> thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]
> One possible strategy from the above mentioned thread:
> Assume we have O - set of oberved state sequences (that's our input data),
> $O_i$ is i-th sequence with length
> len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the
> observed state of i-th sequence at the moment of time t.
> Let's write all the seqences one above other. Then take the first L columns
> of such a matrix, then next L columns and so on. Thus we divide our input
> into the chunks of size L. So the n-th chunk contain subsequences with $t$ in
> the iterval [L*
> Denote $V_{t,k}$ as the probabilty that optimal hidden state path is going
> thru hidden state $k$ at the moment $t$. We need such probabilities to select
> the most likely (I call it "optimal") path. To do this we need to know
> $V_{t,k}$ for every $t$ and $k$.
> Let's try to compute $V_{t,k}$ in each chunk indiependetly for each sequence
> in the chunk. I omit here the $i$ index of the sequence, because each seqence
> is the separate task. $t$ will lie in the [L*n, L*(n+1)) interval, where n
> is the number of chunk.
> The first chunk could be processed using usual Viterbi algorithm.
> But to compute i.e. $V_{L*n,k}$ in any other chunk we need to know
> $V_{L*n-1, k1}$ for each
> k1 that is to know the value from the previous chunk which is computed
> separately. We need it because we want to obtain the max and arg-max of
> $V_{L*n-1, k1}$ for each k1.
> At this time there are two alternatives
> 1) we may compute all possible subpaths into the chunk by assuming that
> arg-max is 1, then that is 2, 3 and so on to K. It will requite O(L*K^3) time
> for each chunk. After we get all possible subpaths we may throw away not
> optimal ones using the result of the first chunk processing, then using
> result of the second and so on. Time complexity is O( (M/P) * K^3) for
> overall process.
> 2) we may not process chunks separately and in parallel. Let's just wait
> until first chunk is finished, then use the results to compute the second
> chunk and so on. It will require about O((M/P) * K^2) time. This strategy
> will work very well if we have a lot of sequences with near the same length.
> Workaraounds for cases when lengths differ highly are to be investigated.
> Probably, it's possible to use Fast / lazy viterbi or some monte-carlo
> technique, I'll think about it in the near future.
> Timeline:
> 1. 2 weeks. Prepare a list of all reasonable parallelization strategies,
> probably get some ideas from the papers, analyze them and discuss with the
> mentor/community. Discover Mahout internals and best practices. Probably this
> stage will be reduced during parallel discussions in the mailing list before
> GSoC actually begins.
> 2. 4 weeks. Select the most proper one and implement it in Mahout code as a
> Job. The implementation should work (or be compatible) with
> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
> 3. 2 weeks. Collect debug dataset, write tests for the code and find possible
> problems, fix them.
> 4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
> write and perform such a test (here a community help might be needed).
> If there are any problems discuss and analyze them, then fix.
> 5. 1 week. Write documentation for all classes, description of algorithm in
> wiki and complete example of usage.
> Additional Information:
> The project relates to [this
> proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about
> implementing parallel Baum-Welch traning algorithm for HMM.
> The important detail is that I have exams in the university until 16th June.
> After that I will be ready to work on the project.
> References:
> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models
> and selected applications in speech
> recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf].
> Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
> [2] [Adjusted Viterbi Training. A proof of
> concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
> [3]
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
> [5] http://www.kanungo.com/software/hmmtut.pdf
> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira