Hi! Here is my proposal draft for GSoC. Your comments, questions, etc are welcome.
I already wrote to the mailing list about my GSoC idea which was implementing parallel versions of HMM algorithms, and since Dhruv Kumar has taken the Baum-Welch algorithm (https://issues.apache.org/jira/browse/MAHOUT-627), which is training algorithm for HMM, I'd like to implement Viterbi algorithm (which is evaluating one). If our projects will succeed then HMM functionality in Mahout will be ready to work in parallel mode and therefore to handle big amounts of data both for training and evaluating. Since HMM has many applications such as speech recognition and generation, natural language processing, bioinformatics etc, and at the moment Mahout lacks a little for graphical models (and as far as I know distributed/parallel algorithms for graphical models is still not studied a lot), such a contribution will be helpful for the community. The Viterbi algorithms is a quite common dynamic programming algorithm, it's well described in many sources such as [1], [2], [3]. As being popular and needed, Viterbi algorithm already have parallel versions which became a subject of several studies, for example, this paper about implementation for GPU [4]. Some of these papers may contain useful ideas for map-reduce implementation. There are many possible strategies for parallelization (which one is better is an open question for the project), for example, one may be the next: 1) divide all inputs (provided by sequence of observations) into chunks of some size 2) set the active chunk to 1 3) calculate optimal paths in active chunk (in parallel mode) 4) set the next chunk as active and perform step 3 using optimal path probalities from the previous step. 5) repeat until there are unlabeled chunks Project stages: 1. Prepare a list of all reasonable parallelization strategies, analyze them and discuss with the mentor/community 2. Select the most proper one and implement it in Mahout code as a Job. The implementation should work (or be compatible) with org.apache.mahout.classifier.sequencelearning.hmm.HmmModel and class 3. Collect debug dataset and find possible bugs in the code, fix them. 4. Try to get a large dataset to test performance, scalability etc, write and perform such a test (here a community help might be needed). If there are any problems discuss and analyze them, then fix. 5. Write documentation for all classes, description of algorithm in wiki and complete example of usage. 1. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf 2. http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf 3. http://www.kanungo.com/software/hmmtut.pdf 4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903 Timeline will be provided a little bit later after I get some some feedback. That's all, so what would you say?
