Hi Sergey, Sorry for the delay, I've been under a rock for the last few weeks. Please open a JIRA issue for your proposal. Overall, I think it sounds reasonable, but please do think about timelines a bit more when you submit. I'm not totally up on the techniques just yet, so I will need to go do some background reading.
I would also think about whether you can narrow down step 1 of your project stage. Is this something you can do up front a bit more? I'd hate to see that bleed into weeks and weeks of the start of the project. Cheers, Grant On Mar 28, 2011, at 8:46 AM, Sergey Bartunov wrote: > Hi! Here is my proposal draft for GSoC. Your comments, questions, etc > are welcome. > > I already wrote to the mailing list about my GSoC idea which was > implementing parallel versions of HMM algorithms, and since Dhruv > Kumar has taken the Baum-Welch algorithm > (https://issues.apache.org/jira/browse/MAHOUT-627), which is training > algorithm for HMM, I'd like to implement Viterbi algorithm (which is > evaluating one). If our projects will succeed then HMM functionality > in Mahout will be ready to work in parallel mode and therefore to > handle big amounts of data both for training and evaluating. Since HMM > has many applications such as speech recognition and generation, > natural language processing, bioinformatics etc, and at the moment > Mahout lacks a little for graphical models (and as far as I know > distributed/parallel algorithms for graphical models is still not > studied a lot), such a contribution will be helpful for the community. > > The Viterbi algorithms is a quite common dynamic programming > algorithm, it's well described in many sources such as [1], [2], [3]. > As being popular and needed, Viterbi algorithm already have parallel > versions which became a subject of several studies, for example, this > paper about implementation for GPU [4]. Some of these papers may > contain useful ideas for map-reduce implementation. > > There are many possible strategies for parallelization (which one is > better is an open question for the project), for example, one may be > the next: > 1) divide all inputs (provided by sequence of observations) into > chunks of some size > 2) set the active chunk to 1 > 3) calculate optimal paths in active chunk (in parallel mode) > 4) set the next chunk as active and perform step 3 using optimal path > probalities from the previous step. > 5) repeat until there are unlabeled chunks > > Project stages: > 1. Prepare a list of all reasonable parallelization strategies, > analyze them and discuss with the mentor/community > 2. Select the most proper one and implement it in Mahout code as a > Job. The implementation should work (or be compatible) with > org.apache.mahout.classifier.sequencelearning.hmm.HmmModel and class > 3. Collect debug dataset and find possible bugs in the code, fix them. > 4. Try to get a large dataset to test performance, scalability etc, > write and perform such a test (here a community help might be needed). > If there are any problems discuss and analyze them, then fix. > 5. Write documentation for all classes, description of algorithm in > wiki and complete example of usage. > > 1. > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf > 2. http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf > 3. http://www.kanungo.com/software/hmmtut.pdf > 4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903 > > Timeline will be provided a little bit later after I get some some feedback. > > That's all, so what would you say?
