Hi! Here is my proposal draft for GSoC. Your comments, questions, etc
are welcome.

I already wrote to the mailing list about my GSoC idea which was
implementing parallel versions of HMM algorithms, and since Dhruv
Kumar has taken the Baum-Welch algorithm
(https://issues.apache.org/jira/browse/MAHOUT-627), which is training
algorithm for HMM, I'd like to implement Viterbi algorithm (which is
evaluating one). If our projects will succeed then HMM functionality
in Mahout will be ready to work in parallel mode and therefore to
handle big amounts of data both for training and evaluating. Since HMM
has many applications such as speech recognition and generation,
natural language processing, bioinformatics etc, and at the moment
Mahout lacks a little for graphical models (and as far as I know
distributed/parallel algorithms for graphical models is still not
studied a lot), such a contribution will be helpful for the community.

The Viterbi algorithms is a quite common dynamic programming
algorithm, it's well described in many sources such as [1], [2], [3].
As being popular and needed, Viterbi algorithm already have parallel
versions which became a subject of several studies, for example, this
paper about implementation for GPU [4]. Some of these papers may
contain useful ideas for map-reduce implementation.

There are many possible strategies for parallelization (which one is
better is an open question for the project), for example,  one may be
the next:
1) divide all inputs (provided by sequence of observations) into
chunks of some size
2) set the active chunk to 1
3) calculate optimal paths in active chunk (in parallel mode)
4) set the next chunk as active and perform step 3 using optimal path
probalities from the previous step.
5) repeat until there are unlabeled chunks

Project stages:
1. Prepare a list of all reasonable parallelization strategies,
analyze them and discuss with the mentor/community
2. Select the most proper one and implement it in Mahout code as a
Job. The implementation should work (or be compatible) with
org.apache.mahout.classifier.sequencelearning.hmm.HmmModel and class
3. Collect debug dataset and find possible bugs in the code, fix them.
4. Try to get a large dataset to test performance, scalability etc,
write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
5. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.

1. 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
2. http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
3. http://www.kanungo.com/software/hmmtut.pdf
4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903

Timeline will be provided a little bit later after I get some some feedback.

That's all, so what would you say?

Reply via email to