[jira] [Updated] (MAHOUT-652) [GSoC Proposal] Parallel Viterbi algorithm for HMM

Sergey Bartunov (JIRA) Fri, 08 Apr 2011 04:36:49 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Bartunov updated MAHOUT-652:
-----------------------------------

    Description: 
Proposal Title: Parallel Viterbi algorithm for HMM

Student Name: Sergey Bartunov

Student E-mail: [email protected]

 

Organization/Project: Apache Mahout

Assigned Mentor:


Proposal Abstract:

The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
It estimates the most likely sequence of hidden states called "Viterbi path" 
from given sequence of observed states. Viterbi algorithm is also used in 
Viterbi training which is less numerical stable than Baum-Welch algorithm but 
faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
Model is widely used for speech and gesture recognition, part-of-speech 
tagging, bioinformatics etc. At the moment Apache Mahout contains only 
sequential HMM functionality, and this project is intended to extend it by 
implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
able to evaluate HMM on big amounts of data in parallel mode. 

Detailed Description:

The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
well described in many sources such as [3], [4], [5]. As being popular and 
needed, Viterbi algorithm already have parallel versions which became a subject 
of several studies, for example, this paper about implementation for GPU [6]. 
Some of these papers may contain useful ideas for map-reduce implementation.

There are several possible strategies for parallelization (which one is
better is an open question for the project) I discovered and posted to the 
mailing list.
See [this 
thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]

The most reasonable one is:
Assume we have O - set of oberved state sequences (that's our input data), 
$O_i$ is i-th sequence with length 
len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the 
observed state of i-th sequence at the moment of time t. Let the N be the 
maximum of len($O_i$).

Let's write all the seqences one above other. We will get the matrix which rows 
are the input seqeunces. Then take the first L columns of this matrix and name 
them the "first chunk", then next L columns and so on. Thus we divide our input 
into the chunks of size L (less or equal to L). L is the parameter of the 
algorithm. So the n-th chunk contain subsequences with $t$ in the iterval [L*n, 
L*(n+1)). 

The key idea is to process each chunk by standard sequential Viterbi algorithm 
(the existing Mahout code could be reused here) in parallel mode one by one 
using results of previous chunk processing.

It will require about O((M/P) * K^2) time where M is sum of all sequence 
lengths if most of input sequences have near the same length (the good case). P 
here is the number of "cores" / computational nodes. This time complexity means 
that such strategy scales well. 

Here is Map-Reduce scheme for the good case:

* Map emits (subsequence, probabilites, paths) vector for each subsequence in 
the chunk, where probabilities is the initial probabilites in case of the first 
chunk and optimal-path probabilities in all other cases, paths means optimal 
paths to the first observations of each subsequence. 
* Reduce function just performs sequential Viterbi algorithm on these data and 
returns (probabilities, paths).
* Driver class runs the Map then Reduce for the first chunk, then Map and 
Reduce for the second, and so on. Then provide the results (probably using 
another map-reduce combination).

Probably, it's possible to use lazy viterbi [7,8] instead of the standard 
Viterbi for chunk processing.

Let's focus now on all other cases when first T chunks contain the subsequences 
with near the same length and all other N-T chunks do not. Well, then they 
could be processed using the next technique (which is strategy number 2 in [the 
list|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]):

At first, denote $V_{t,k}$ as the probabilty that optimal hidden state path 
goes through hidden state $k$ at the moment $t$. We need such probabilities to 
select the most likely (or optimal) path so we need to compute $V_{t,k}$ for 
every $t$ and $k$ (that is the part of standard Viterbi). I omit here the $i$ 
index of the sequence, because each seqence is the separate task.

1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval, where 
n is the number of chunk and to compute  Actually to do this, we need to 
compute $V_{t,k}$ we need to know $V_{L*n-1,k1}$ for each $k1$. But this value 
belong to the chunk the previous number (n-1) which is processed separately. We 
need to obtain the max and arg-max of $V_{L*n-1, k1}$ for each k1. 
2) Since we don't know the max's and arg-max's let's just assume that optimal 
path goes through the k1=1 then through k2=2 and so on up to the K, and 
$V_{L*n-1, k1}$ = 1 (or 0 if we use log(prob) adding instead of prob 
multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the 
$V_{L*(n+1)-1, k1}$ by the precise probability max{ $V_{L*n-1, k1}$ } (or add 
the log of it). Thus we get  K^2 possibly optimal paths instead of K.
3) After we process all the chunks we start "connecting" phase. 

The first of these N-T chunks could be processed by using usual Viterbi 
processing since we know the results for the previous "good" chunk. All others 
need to be connected each to other starting from connecting second this first 
then 3th with 4th and so on. During connection process we throw away all 
non-optimal paths (know we know which are optimal and which are not) and leave 
just K possibly optimal paths.

To handle such difficult cases the Driver class should know the sequence 
lengths to recognize which Map-Reduce combination it should use. The resulting 
time complexity for difficult case is 
O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3). The first term of this 
sum is time for computing the first T chunks using simple case approach, and 
the second term is time for computing the rest using difficult case approach. 
When T -> N overall time tend to O((M/P) * K^2). T * L / N is about total 
length of subsequences in the first T chunks and (((T-N) * L / N)) is total 
length of the rest.

The only problem not discussed here is how to deal with obtained most likely 
paths. Path length is equal to length of a sequence, so it's nood a good idea 
to emit optimal path in Map stage. They could be stored separately in the files 
or HBase since they are required only in getting the results to the user.

This implementation should work (or be compatible) with 
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel

As you see, Viterbi algorithm use very common dynamic approach. It computes the 
next computation layer by combining values the previous, that's all we need to 
know how to implement it in the Map-Reduce and it shares the mentioned 
difficulties with all other such dynamic algorithms, so the implentation may be 
highly resuable.

Timeline:
1. 1 week. Discover Mahout internals, sequential HMM code and code best 
practices. Think on problem with path storing. I'll work on this stage actually 
before GSoC starts but I leave here 1 week just to be sure.
2. 5 days. Write the chunk division routines. 
3. 5 days. Write the SimpleMapper class for simple good case of data.
4. 1 week. Write the SimpleReducer class for simple good case.
5. 1 week. Write the Driver class for simple good case which will use 
SimpleMapper and SimpleReducer. 
6. 1 week. Collect debug dataset, write tests for the code and find possible 
problems, fix them.
7. 10 days. Write the HardReducer class for difficult bad case and rewrite 
Driver class to handle such cases. Perform tests, ensure that everything works.
7. 10 days. Try to get a large dataset to test performance, scalability etc, 
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy viterby) 
in chunk processing phase, measure speed improvement especially for difficult 
cases.
9. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.

Additional Information:

The project relates to [this 
proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
implementing parallel Baum-Welch traning algorithm for HMM.

My motivation for this project is that I need such a functionality as a regular 
Mahout user since I face HMM often in my research work. Also I'm very 
interested in what kinds or classes of algorithms could be efficiently 
implemented in Map-Reduce, I also think that this is important question for the 
community and such an experience is significant.

The important detail is that I have exams in the university until 16th June. 
After that I will be ready to work on the project.

References:

[1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
and selected applications in speech 
recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. 
[2] [Adjusted Viterbi Training. A proof of 
concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
[3] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5] http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
[7] http://www.scribd.com/doc/48568001/Lazy-Viterbi-Slides
[8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367

  was:
Proposal Title: Parallel Viterbi algorithm for HMM

Student Name: Sergey Bartunov

Student E-mail: [email protected]

 

Organization/Project: Apache Mahout

Assigned Mentor:


Proposal Abstract:

The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
It estimates the most likely sequence of hidden states called "Viterbi path" 
from given sequence of observed states. Viterbi algorithm is also used in 
Viterbi training which is less numerical stable than Baum-Welch algorithm but 
faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
Model is widely used for speech and gesture recognition, part-of-speech 
tagging, bioinformatics etc. At the moment Apache Mahout contains only 
sequential HMM functionality, and this project is intended to extend it by 
implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
able to evaluate HMM on big amounts of data in parallel mode. 

Detailed Description:

The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
well described in many sources such as [3], [4], [5]. As being popular and 
needed, Viterbi algorithm already have parallel versions which became a subject 
of several studies, for example, this paper about implementation for GPU [6]. 
Some of these papers may contain useful ideas for map-reduce implementation.

There are several possible strategies for parallelization (which one is
better is an open question for the project) I discovered and posted to the 
mailing list.
See [this 
thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]

The most reasonable one is:
Assume we have O - set of oberved state sequences (that's our input data), 
$O_i$ is i-th sequence with length 
len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the 
observed state of i-th sequence at the moment of time t. Let the N be the 
maximum of len($O_i$).

Let's write all the seqences one above other. We will get the matrix which rows 
are the input seqeunces. Then take the first L columns of this matrix and name 
them the "first chunk", then next L columns and so on. Thus we divide our input 
into the chunks of size L (less or equal to L). L is the parameter of the 
algorithm. So the n-th chunk contain subsequences with $t$ in the iterval [L*n, 
L*(n+1)). 

The key idea is to process each chunk by standard sequential Viterbi algorithm 
(the existing Mahout code could be reused here) in parallel mode one by one 
using results of previous chunk processing.

It will require about O((M/P) * K^2) time where M is sum of all sequence 
lengths if most of input sequences have near the same length (the good case). P 
here is the number of "cores" / computational nodes. This time complexity means 
that such strategy scales well. 

Here is Map-Reduce scheme for the good case:

* Map emits (subsequence, probabilites, paths) vector for each subsequence in 
the chunk, where probabilites is the initial probabilites in case of the first 
chunk and optimal-path probabilites in all other cases, paths means optimal 
paths to the first observations of each subsequence. 
* Reduce function just performs sequential Viterbi algorithm on these data and 
returns (probabilities, paths).
* Driver class runs the Map then Reduce for the first chunk, then Map and 
Reduce for the second and so on. 

Probably, it's possible to use lazy viterbi [7,8] instead of the standard 
Viterbi for chunk processing.

Let's focus now on all other cases when first T chunks contain the subsequences 
with near the same length and all other N-T chunks do not. Well, then they 
could be processed using the next technique (which is strategy number 2 in [the 
list|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]):

At first, denote $V_{t,k}$ as the probabilty that optimal hidden state path 
goes through hidden state $k$ at the moment $t$. We need such probabilities to 
select the most likely (or optimal) path so we need to compute $V_{t,k}$ for 
every $t$ and $k$ (that is the part of standard Viterbi). I omit here the $i$ 
index of the sequence, because each seqence is the separate task.

1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval, where 
n is the number of chunk and to compute  Actually to do this, we need to 
compute $V_{t,k}$ we need to know $V_{L*n-1,k1}$ for each $k1$. But this value 
belong to the chunk the previous number (n-1) which is processed separately. We 
need to obtain the max and arg-max of $V_{L*n-1, k1}$ for each k1. 
2) Since we don't know the max's and arg-max's let's just assume that optimal 
path goes through the k1=1 then through k2=2 and so on up to the K, and 
$V_{L*n-1, k1}$ = 1 (or 0 if we use log(prob) adding instead of prob 
multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the 
$V_{L*(n+1)-1, k1}$ by the precise probability max{ $V_{L*n-1, k1}$ } (or add 
the log of it). Thus we get  K^2 possibly optimal paths instead of K.
3) After we process all the chunks we start "connecting" phase. 

The first of these N-T chunks could be processed by using usual Viterbi 
processing since we know the results for the previous "good" chunk. All others 
need to be connected each to other starting from connecting second this first 
then 3th with 4th and so on. During connection process we throw away all 
non-optimal paths (know we know which are optimal and which are not) and leave 
just K possibly optimal paths.

To handle such difficult cases the Driver class should know the sequence 
lengths and use different Map-Reduce combinations.

The only problem not discussed here is how to deal with obtained most likely 
paths. Path length is equal to length of a sequence, it's nood a good idea to 
emit optimal path in Map stage. They could be stored separately in the files or 
HBase since they are required only in getting the results to the user.

This implementation should work (or be compatible) with 
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel

As you may see, Viterbi algorithm use very common dynamic approach. It computes 
the next computation layer by combining values the previous, that's all we need 
to know how to implement it in the Map-Reduce and it shares the mentioned 
difficulties with all other such dynamic algorithms, so the implentation may be 
highly resuable.

Timeline:
1. 1 week. Discover Mahout internals, sequential HMM code and code best 
practices. Think on problem with path storing. I'll work on this stage actually 
before GSoC starts but I leave here 1 week just to be sure.
2. 5 days. Write the chunk division routines. 
3. 5 days. Write the SimpleMapper class for simple good case of data.
4. 1 week. Write the SimpleReducer class for simple good case.
5. 1 week. Write the Driver class for simple good case which will use 
SimpleMapper and SimpleReducer. 
6. 1 week. Collect debug dataset, write tests for the code and find possible 
problems, fix them.
7. 10 days. Write the HardReducer class for difficult bad case and rewrite 
Driver class to handle such cases. Perform tests, ensure that everything works.
7. 10 days. Try to get a large dataset to test performance, scalability etc, 
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy viterby) 
in chunk processing phase, measure speed improvement especially for difficult 
cases.
9. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.

Additional Information:

The project relates to [this 
proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
implementing parallel Baum-Welch traning algorithm for HMM.

My motivation for this project is that I need such a functionality as a regular 
Mahout user since I face HMM often in my research work. Also I'm very 
interested in what kinds or classes of algorithms could be efficiently 
implemented in Map-Reduce, I also think that this is important question for the 
community and such an experience is significant.

The important detail is that I have exams in the university until 16th June. 
After that I will be ready to work on the project.

References:

[1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
and selected applications in speech 
recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. 
[2] [Adjusted Viterbi Training. A proof of 
concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
[3] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5] http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
[7] http://www.scribd.com/doc/48568001/Lazy-Viterbi-Slides
[8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367


> [GSoC Proposal] Parallel Viterbi algorithm for HMM
> --------------------------------------------------
>
>                 Key: MAHOUT-652
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-652
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Sergey Bartunov
>              Labels: classification, gsoc,, gsoc11, hmm, mahout-gsoc-11, 
> sequence-learning
>
> Proposal Title: Parallel Viterbi algorithm for HMM
> Student Name: Sergey Bartunov
> Student E-mail: [email protected]
>  
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
> It estimates the most likely sequence of hidden states called "Viterbi path" 
> from given sequence of observed states. Viterbi algorithm is also used in 
> Viterbi training which is less numerical stable than Baum-Welch algorithm but 
> faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
> Model is widely used for speech and gesture recognition, part-of-speech 
> tagging, bioinformatics etc. At the moment Apache Mahout contains only 
> sequential HMM functionality, and this project is intended to extend it by 
> implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
> able to evaluate HMM on big amounts of data in parallel mode. 
> Detailed Description:
> The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
> well described in many sources such as [3], [4], [5]. As being popular and 
> needed, Viterbi algorithm already have parallel versions which became a 
> subject of several studies, for example, this paper about implementation for 
> GPU [6]. Some of these papers may contain useful ideas for map-reduce 
> implementation.
> There are several possible strategies for parallelization (which one is
> better is an open question for the project) I discovered and posted to the 
> mailing list.
> See [this 
> thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]
> The most reasonable one is:
> Assume we have O - set of oberved state sequences (that's our input data), 
> $O_i$ is i-th sequence with length 
> len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the 
> observed state of i-th sequence at the moment of time t. Let the N be the 
> maximum of len($O_i$).
> Let's write all the seqences one above other. We will get the matrix which 
> rows are the input seqeunces. Then take the first L columns of this matrix 
> and name them the "first chunk", then next L columns and so on. Thus we 
> divide our input into the chunks of size L (less or equal to L). L is the 
> parameter of the algorithm. So the n-th chunk contain subsequences with $t$ 
> in the iterval [L*n, L*(n+1)). 
> The key idea is to process each chunk by standard sequential Viterbi 
> algorithm (the existing Mahout code could be reused here) in parallel mode 
> one by one using results of previous chunk processing.
> It will require about O((M/P) * K^2) time where M is sum of all sequence 
> lengths if most of input sequences have near the same length (the good case). 
> P here is the number of "cores" / computational nodes. This time complexity 
> means that such strategy scales well. 
> Here is Map-Reduce scheme for the good case:
> * Map emits (subsequence, probabilites, paths) vector for each subsequence in 
> the chunk, where probabilities is the initial probabilites in case of the 
> first chunk and optimal-path probabilities in all other cases, paths means 
> optimal paths to the first observations of each subsequence. 
> * Reduce function just performs sequential Viterbi algorithm on these data 
> and returns (probabilities, paths).
> * Driver class runs the Map then Reduce for the first chunk, then Map and 
> Reduce for the second, and so on. Then provide the results (probably using 
> another map-reduce combination).
> Probably, it's possible to use lazy viterbi [7,8] instead of the standard 
> Viterbi for chunk processing.
> Let's focus now on all other cases when first T chunks contain the 
> subsequences with near the same length and all other N-T chunks do not. Well, 
> then they could be processed using the next technique (which is strategy 
> number 2 in [the 
> list|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%[email protected]%3E]):
> At first, denote $V_{t,k}$ as the probabilty that optimal hidden state path 
> goes through hidden state $k$ at the moment $t$. We need such probabilities 
> to select the most likely (or optimal) path so we need to compute $V_{t,k}$ 
> for every $t$ and $k$ (that is the part of standard Viterbi). I omit here the 
> $i$ index of the sequence, because each seqence is the separate task.
> 1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval, 
> where n is the number of chunk and to compute  Actually to do this, we need 
> to compute $V_{t,k}$ we need to know $V_{L*n-1,k1}$ for each $k1$. But this 
> value belong to the chunk the previous number (n-1) which is processed 
> separately. We need to obtain the max and arg-max of $V_{L*n-1, k1}$ for each 
> k1. 
> 2) Since we don't know the max's and arg-max's let's just assume that optimal 
> path goes through the k1=1 then through k2=2 and so on up to the K, and 
> $V_{L*n-1, k1}$ = 1 (or 0 if we use log(prob) adding instead of prob 
> multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the 
> $V_{L*(n+1)-1, k1}$ by the precise probability max{ $V_{L*n-1, k1}$ } (or add 
> the log of it). Thus we get  K^2 possibly optimal paths instead of K.
> 3) After we process all the chunks we start "connecting" phase. 
> The first of these N-T chunks could be processed by using usual Viterbi 
> processing since we know the results for the previous "good" chunk. All 
> others need to be connected each to other starting from connecting second 
> this first then 3th with 4th and so on. During connection process we throw 
> away all non-optimal paths (know we know which are optimal and which are not) 
> and leave just K possibly optimal paths.
> To handle such difficult cases the Driver class should know the sequence 
> lengths to recognize which Map-Reduce combination it should use. The 
> resulting time complexity for difficult case is 
> O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3). The first term of this 
> sum is time for computing the first T chunks using simple case approach, and 
> the second term is time for computing the rest using difficult case approach. 
> When T -> N overall time tend to O((M/P) * K^2). T * L / N is about total 
> length of subsequences in the first T chunks and (((T-N) * L / N)) is total 
> length of the rest.
> The only problem not discussed here is how to deal with obtained most likely 
> paths. Path length is equal to length of a sequence, so it's nood a good idea 
> to emit optimal path in Map stage. They could be stored separately in the 
> files or HBase since they are required only in getting the results to the 
> user.
> This implementation should work (or be compatible) with 
> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
> As you see, Viterbi algorithm use very common dynamic approach. It computes 
> the next computation layer by combining values the previous, that's all we 
> need to know how to implement it in the Map-Reduce and it shares the 
> mentioned difficulties with all other such dynamic algorithms, so the 
> implentation may be highly resuable.
> Timeline:
> 1. 1 week. Discover Mahout internals, sequential HMM code and code best 
> practices. Think on problem with path storing. I'll work on this stage 
> actually before GSoC starts but I leave here 1 week just to be sure.
> 2. 5 days. Write the chunk division routines. 
> 3. 5 days. Write the SimpleMapper class for simple good case of data.
> 4. 1 week. Write the SimpleReducer class for simple good case.
> 5. 1 week. Write the Driver class for simple good case which will use 
> SimpleMapper and SimpleReducer. 
> 6. 1 week. Collect debug dataset, write tests for the code and find possible 
> problems, fix them.
> 7. 10 days. Write the HardReducer class for difficult bad case and rewrite 
> Driver class to handle such cases. Perform tests, ensure that everything 
> works.
> 7. 10 days. Try to get a large dataset to test performance, scalability etc, 
> write and perform such a test (here a community help might be needed).
> If there are any problems discuss and analyze them, then fix.
> 8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy viterby) 
> in chunk processing phase, measure speed improvement especially for difficult 
> cases.
> 9. 1 week. Write documentation for all classes, description of algorithm in
> wiki and complete example of usage.
> Additional Information:
> The project relates to [this 
> proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
> implementing parallel Baum-Welch traning algorithm for HMM.
> My motivation for this project is that I need such a functionality as a 
> regular Mahout user since I face HMM often in my research work. Also I'm very 
> interested in what kinds or classes of algorithms could be efficiently 
> implemented in Map-Reduce, I also think that this is important question for 
> the community and such an experience is significant.
> The important detail is that I have exams in the university until 16th June. 
> After that I will be ready to work on the project.
> References:
> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
> and selected applications in speech 
> recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
> Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. 
> [2] [Adjusted Viterbi Training. A proof of 
> concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
> [3] 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
> [5] http://www.kanungo.com/software/hmmtut.pdf
> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
> [7] http://www.scribd.com/doc/48568001/Lazy-Viterbi-Slides
> [8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-652) [GSoC Proposal] Parallel Viterbi algorithm for HMM

Reply via email to