[ 
https://issues.apache.org/jira/browse/MAHOUT-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Bartunov updated MAHOUT-652:
-----------------------------------

    Description: 
Proposal Title: Parallel Viterbi algorithm for HMM

Student Name: Sergey Bartunov

Student E-mail: [email protected]

 

Organization/Project: Apache Mahout

Assigned Mentor:


Proposal Abstract:

The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
It estimates the most likely sequence of hidden states called "Viterbi path" 
from given sequence of observed states. Viterbi algorithm is also used in 
Viterbi training which is less numerical stable than Baum-Welch algorithm but 
faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
Model is widely used for speech and gesture recognition, part-of-speech 
tagging, bioinformatics etc. At the moment Apache Mahout contains only 
sequential HMM functionality, and this project is intended to extend it by 
implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
able to evaluate HMM on big amounts of data in parallel mode. 

Detailed Description:

The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
well described in many sources such as [3], [4], [5]. As being popular and 
needed, Viterbi algorithm already have parallel versions which became a subject 
of several studies, for example, this paper about implementation for GPU [6]. 
Some of these papers may contain useful ideas for map-reduce implementation.

There are several possible strategies for parallelization (which one is
better is an open question for the project) and I'll discuss them in the 
mailing list.
See [this 
thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]

One possible strategy from the above mentioned thread:
Assume we have O - set of oberved state sequences (that's our input data), 
$O_i$ is i-th sequence with length 
len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the 
observed state of i-th sequence at the moment of time t.

Let's write all the seqences one above other. Then take the first L columns of 
such a matrix, then next L columns and so on. Thus we divide our input into the 
chunks of size L. So the n-th chunk contain subsequences with $t$ in the 
iterval [L*

Denote $V_{t,k}$ as the probabilty that optimal hidden state path is going thru 
hidden state $k$ at the moment $t$. We need such probabilities to select the 
most likely (I call it "optimal") path. To do this we need to know  $V_{t,k}$ 
for every $t$ and $k$.

Let's try to compute $V_{t,k}$ in each chunk indiependetly for each sequence in 
the chunk. I omit here the $i$ index of the sequence, because each seqence is 
the separate task. $t$ will lie in the  [L*n, L*(n+1)) interval, where n is the 
number of chunk.

The first chunk could be processed using usual Viterbi algorithm. 
But to compute i.e.  $V_{L*n,k}$ in any other chunk we need to know $V_{L*n-1, 
k1}$ for each
k1 that is to know the value from the previous chunk which is computed 
separately. We need it because we want to obtain the max and arg-max of 
$V_{L*n-1, k1}$ for each k1. 

At this time there are two alternatives
1) we may compute all possible subpaths into the chunk by assuming that arg-max 
is 1, then that is 2, 3 and so on to K. It will requite O(L*K^3) time for each 
chunk. After we get all possible subpaths we may throw away not optimal ones 
using the result of the first chunk processing, then using result of the second 
and so on. Time complexity is O( (M/P) * K^3) for overall process.
2) we may not process chunks separately and in parallel. Let's just wait until 
first chunk is finished, then use the results to compute the second chunk and 
so on. It will require about O((M/P) * K^2) time. This strategy will work very 
well if we have a lot of sequences with near the same length. Workaraounds for 
cases when lengths differ highly are to be investigated. 

Probably, it's possible to use Fast / lazy viterbi or some monte-carlo 
technique, I'll think about it in the near future. 

Timeline:
1. 2 weeks. Prepare a list of all reasonable parallelization strategies, 
probably get some ideas from the papers, analyze them and discuss with the 
mentor/community. Discover Mahout internals and best practices. Probably this 
stage will be reduced during parallel discussions in the mailing list before 
GSoC actually begins.
2. 4 weeks. Select the most proper one and implement it in Mahout code as a
Job. The implementation should work (or be compatible) with
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
3. 2 weeks. Collect debug dataset, write tests for the code and find possible 
problems, fix them.
4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
5. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.

Additional Information:

The project relates to [this 
proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
implementing parallel Baum-Welch traning algorithm for HMM.

The important detail is that I have exams in the university until 16th June. 
After that I will be ready to work on the project.

References:

[1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
and selected applications in speech 
recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. 
[2] [Adjusted Viterbi Training. A proof of 
concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
[3] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5]  http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903

  was:
Proposal Title: Parallel Viterbi algorithm for HMM

Student Name: Sergey Bartunov

Student E-mail: [email protected]

 

Organization/Project: Apache Mahout

Assigned Mentor:


Proposal Abstract:

The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
It estimates the most likely sequence of hidden states called "Viterbi path" 
from given sequence of observed states. Viterbi algorithm is also used in 
Viterbi training which is less numerical stable than Baum-Welch algorithm but 
faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
Model is widely used for speech and gesture recognition, part-of-speech 
tagging, bioinformatics etc. At the moment Apache Mahout contains only 
sequential HMM functionality, and this project is intended to extend it by 
implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
able to evaluate HMM on big amounts of data in parallel mode. 

Detailed Description:

The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
well described in many sources such as [3], [4], [5]. As being popular and 
needed, Viterbi algorithm already have parallel versions which became a subject 
of several studies, for example, this paper about implementation for GPU [6]. 
Some of these papers may contain useful ideas for map-reduce implementation.

There are several possible strategies for parallelization (which one is
better is an open question for the project) and I'll discuss them in the 
mailing list.

[Probably at least one strategy will be placed here later]

Timeline:
1. 2 weeks. Prepare a list of all reasonable parallelization strategies, 
probably get some ideas from the papers, analyze them and discuss with the 
mentor/community. Discover Mahout internals and best practices. Probably this 
stage will be reduced during parallel discussions in the mailing list before 
GSoC actually begins.
2. 4 weeks. Select the most proper one and implement it in Mahout code as a
Job. The implementation should work (or be compatible) with
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
3. 2 weeks. Collect debug dataset, write tests for the code and find possible 
problems, fix them.
4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
5. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.

Additional Information:

The project relates to [this 
proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
implementing parallel Baum-Welch traning algorithm for HMM.

The important detail is that I have exams in the university until 16th June. 
After that I will be ready to work on the project.

References:

[1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
and selected applications in speech 
recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. 
[2] [Adjusted Viterbi Training. A proof of 
concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
[3] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5]  http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903


> [GSoC Proposal] Parallel Viterbi algorithm for HMM
> --------------------------------------------------
>
>                 Key: MAHOUT-652
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-652
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Sergey Bartunov
>              Labels: classification, gsoc,, gsoc11, hmm, mahout-gsoc-11, 
> sequence-learning
>
> Proposal Title: Parallel Viterbi algorithm for HMM
> Student Name: Sergey Bartunov
> Student E-mail: [email protected]
>  
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
> It estimates the most likely sequence of hidden states called "Viterbi path" 
> from given sequence of observed states. Viterbi algorithm is also used in 
> Viterbi training which is less numerical stable than Baum-Welch algorithm but 
> faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
> Model is widely used for speech and gesture recognition, part-of-speech 
> tagging, bioinformatics etc. At the moment Apache Mahout contains only 
> sequential HMM functionality, and this project is intended to extend it by 
> implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
> able to evaluate HMM on big amounts of data in parallel mode. 
> Detailed Description:
> The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
> well described in many sources such as [3], [4], [5]. As being popular and 
> needed, Viterbi algorithm already have parallel versions which became a 
> subject of several studies, for example, this paper about implementation for 
> GPU [6]. Some of these papers may contain useful ideas for map-reduce 
> implementation.
> There are several possible strategies for parallelization (which one is
> better is an open question for the project) and I'll discuss them in the 
> mailing list.
> See [this 
> thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]
> One possible strategy from the above mentioned thread:
> Assume we have O - set of oberved state sequences (that's our input data), 
> $O_i$ is i-th sequence with length 
> len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the 
> observed state of i-th sequence at the moment of time t.
> Let's write all the seqences one above other. Then take the first L columns 
> of such a matrix, then next L columns and so on. Thus we divide our input 
> into the chunks of size L. So the n-th chunk contain subsequences with $t$ in 
> the iterval [L*
> Denote $V_{t,k}$ as the probabilty that optimal hidden state path is going 
> thru hidden state $k$ at the moment $t$. We need such probabilities to select 
> the most likely (I call it "optimal") path. To do this we need to know  
> $V_{t,k}$ for every $t$ and $k$.
> Let's try to compute $V_{t,k}$ in each chunk indiependetly for each sequence 
> in the chunk. I omit here the $i$ index of the sequence, because each seqence 
> is the separate task. $t$ will lie in the  [L*n, L*(n+1)) interval, where n 
> is the number of chunk.
> The first chunk could be processed using usual Viterbi algorithm. 
> But to compute i.e.  $V_{L*n,k}$ in any other chunk we need to know 
> $V_{L*n-1, k1}$ for each
> k1 that is to know the value from the previous chunk which is computed 
> separately. We need it because we want to obtain the max and arg-max of 
> $V_{L*n-1, k1}$ for each k1. 
> At this time there are two alternatives
> 1) we may compute all possible subpaths into the chunk by assuming that 
> arg-max is 1, then that is 2, 3 and so on to K. It will requite O(L*K^3) time 
> for each chunk. After we get all possible subpaths we may throw away not 
> optimal ones using the result of the first chunk processing, then using 
> result of the second and so on. Time complexity is O( (M/P) * K^3) for 
> overall process.
> 2) we may not process chunks separately and in parallel. Let's just wait 
> until first chunk is finished, then use the results to compute the second 
> chunk and so on. It will require about O((M/P) * K^2) time. This strategy 
> will work very well if we have a lot of sequences with near the same length. 
> Workaraounds for cases when lengths differ highly are to be investigated. 
> Probably, it's possible to use Fast / lazy viterbi or some monte-carlo 
> technique, I'll think about it in the near future. 
> Timeline:
> 1. 2 weeks. Prepare a list of all reasonable parallelization strategies, 
> probably get some ideas from the papers, analyze them and discuss with the 
> mentor/community. Discover Mahout internals and best practices. Probably this 
> stage will be reduced during parallel discussions in the mailing list before 
> GSoC actually begins.
> 2. 4 weeks. Select the most proper one and implement it in Mahout code as a
> Job. The implementation should work (or be compatible) with
> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
> 3. 2 weeks. Collect debug dataset, write tests for the code and find possible 
> problems, fix them.
> 4. 2 weeks. Try to get a large dataset to test performance, scalability etc,
> write and perform such a test (here a community help might be needed).
> If there are any problems discuss and analyze them, then fix.
> 5. 1 week. Write documentation for all classes, description of algorithm in
> wiki and complete example of usage.
> Additional Information:
> The project relates to [this 
> proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
> implementing parallel Baum-Welch traning algorithm for HMM.
> The important detail is that I have exams in the university until 16th June. 
> After that I will be ready to work on the project.
> References:
> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
> and selected applications in speech 
> recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
> Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. 
> [2] [Adjusted Viterbi Training. A proof of 
> concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
> [3] 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
> [5]  http://www.kanungo.com/software/hmmtut.pdf
> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to