This post is related to https://issues.apache.org/jira/browse/MAHOUT-652
As I mentioned before, I thought on possible parallelization strategiest and here is what I have on the moment from my head. NB I’m from Russia and my english is far from ideal. I’ll try to explain my thoughts clean as much as I can, but you misunderstand something, just ask, please. The standard sequentian Viterbi algorithm is well described in many sources (for example, in wikipedia, http://en.wikipedia.org/wiki/Viterbi_algorithm), so I won’t explain it here. Here is what I discovered about possible parallelization of this algorithm. Suppose we have the D - set of observed state sequences, HMM with known parameters, K is the number of hidden states, N_i is the length of i-th sequence. 1. the first idea that comes to mind is that we could move through the sequences and parallel the computation of optimal path to the next step. That means, that we will parallel on the hidden states. In the terms of the wikipedia article that is parallelization of computation $V_{t,k}$ for each $k$. Since the number of states is not great usually (for part-of-speech tagging task it is about 6 for english) such a parallelization is not reasonable in map-reduce paradigm. 2. well, then we could divide all input seqences in chunks and process them independently. The first chunk of each sequences is computed as usual. All other chunks could not be processed such easily. Let’s divide any given sequence into the chunks of size L and the s[m] is the chunk number m (starting from zero). We need to compute $V_{t,k}$ for every t in [L*m, L*(m+1)) and every k. But to compute $V_{L*m,k}$ we need to know $V_{L*m-1, k1}$ for each k that is to know the value from the previous chunk which is computed independently. Actually, we do not need to know what is $V_{L*m-1, k1}$ exactly to get the optimal path. It’s enough to know which k1 maximizes the equation. But we don’t know even this, so we may compute all optimal paths for each k1 by thinking that each k1 is the argmax and $V_{L*m-1, k1}$ = 1 (or 0 if we add the log of probalities). Later, on “connecting” step we just have to multiply the $V_{L*(m+1) - 1,k}$ by actual value of argmax equation (or add it) and obtain the optimal sub-path. This step will require O(L*K^3) time for each chunk. After all chunks was processed we are sure that we know optimal paths in the first chunk - so we may easily throw away all wittingly not-optimal paths in the second chunk then do it for the third and so on. In parallel mode it will take O( (M/P) * K^3) vs O(N * K^2) in sequential Viterbi where P is the number of “cores” and M is the sum of all N’s. It’s bad that there is K in cube, but since K is usually small P may be much greater than K^3 this will get the speed up. At least, this scales. 3. We could just iteratively process every chunk in parallel mode for O((M/P) * K^2) without “connecting chunks” to each other. But this will work well only if we have a lot of sequences near the same length. If not, well, I need to think on this. That’s not at all but all for today.
