> Yes, I'd thought and discovered a lot for the first stage and at the > moment I have a lot of ideas. Where it's better to discuss them - here > on in Jira.
imo here. Jira should probably contain just the summary of what you ultimately decided to do. -d > I'm going to create Jira ticket with completed proposal on today/tomorrow. >> >> Cheers, >> Grant >> >> On Mar 28, 2011, at 8:46 AM, Sergey Bartunov wrote: >> >>> Hi! Here is my proposal draft for GSoC. Your comments, questions, etc >>> are welcome. >>> >>> I already wrote to the mailing list about my GSoC idea which was >>> implementing parallel versions of HMM algorithms, and since Dhruv >>> Kumar has taken the Baum-Welch algorithm >>> (https://issues.apache.org/jira/browse/MAHOUT-627), which is training >>> algorithm for HMM, I'd like to implement Viterbi algorithm (which is >>> evaluating one). If our projects will succeed then HMM functionality >>> in Mahout will be ready to work in parallel mode and therefore to >>> handle big amounts of data both for training and evaluating. Since HMM >>> has many applications such as speech recognition and generation, >>> natural language processing, bioinformatics etc, and at the moment >>> Mahout lacks a little for graphical models (and as far as I know >>> distributed/parallel algorithms for graphical models is still not >>> studied a lot), such a contribution will be helpful for the community. >>> >>> The Viterbi algorithms is a quite common dynamic programming >>> algorithm, it's well described in many sources such as [1], [2], [3]. >>> As being popular and needed, Viterbi algorithm already have parallel >>> versions which became a subject of several studies, for example, this >>> paper about implementation for GPU [4]. Some of these papers may >>> contain useful ideas for map-reduce implementation. >>> >>> There are many possible strategies for parallelization (which one is >>> better is an open question for the project), for example, one may be >>> the next: >>> 1) divide all inputs (provided by sequence of observations) into >>> chunks of some size >>> 2) set the active chunk to 1 >>> 3) calculate optimal paths in active chunk (in parallel mode) >>> 4) set the next chunk as active and perform step 3 using optimal path >>> probalities from the previous step. >>> 5) repeat until there are unlabeled chunks >>> >>> Project stages: >>> 1. Prepare a list of all reasonable parallelization strategies, >>> analyze them and discuss with the mentor/community >>> 2. Select the most proper one and implement it in Mahout code as a >>> Job. The implementation should work (or be compatible) with >>> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel and class >>> 3. Collect debug dataset and find possible bugs in the code, fix them. >>> 4. Try to get a large dataset to test performance, scalability etc, >>> write and perform such a test (here a community help might be needed). >>> If there are any problems discuss and analyze them, then fix. >>> 5. Write documentation for all classes, description of algorithm in >>> wiki and complete example of usage. >>> >>> 1. >>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf >>> 2. http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf >>> 3. http://www.kanungo.com/software/hmmtut.pdf >>> 4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903 >>> >>> Timeline will be provided a little bit later after I get some some feedback. >>> >>> That's all, so what would you say? >> >> >> >> >
