Hi Sergey,

Sorry for the delay, I've been under a rock for the last few weeks.  Please 
open a JIRA issue for your proposal.    Overall, I think it sounds reasonable, 
but please do think about timelines a bit more when you submit.  I'm not 
totally up on the techniques just yet, so I will need to go do some background 
reading.

I would also think about whether you can narrow down step 1 of your project 
stage.  Is this something you can do up front a bit more?  I'd hate to see that 
bleed into weeks and weeks of the start of the project. 

Cheers,
Grant

On Mar 28, 2011, at 8:46 AM, Sergey Bartunov wrote:

> Hi! Here is my proposal draft for GSoC. Your comments, questions, etc
> are welcome.
> 
> I already wrote to the mailing list about my GSoC idea which was
> implementing parallel versions of HMM algorithms, and since Dhruv
> Kumar has taken the Baum-Welch algorithm
> (https://issues.apache.org/jira/browse/MAHOUT-627), which is training
> algorithm for HMM, I'd like to implement Viterbi algorithm (which is
> evaluating one). If our projects will succeed then HMM functionality
> in Mahout will be ready to work in parallel mode and therefore to
> handle big amounts of data both for training and evaluating. Since HMM
> has many applications such as speech recognition and generation,
> natural language processing, bioinformatics etc, and at the moment
> Mahout lacks a little for graphical models (and as far as I know
> distributed/parallel algorithms for graphical models is still not
> studied a lot), such a contribution will be helpful for the community.
> 
> The Viterbi algorithms is a quite common dynamic programming
> algorithm, it's well described in many sources such as [1], [2], [3].
> As being popular and needed, Viterbi algorithm already have parallel
> versions which became a subject of several studies, for example, this
> paper about implementation for GPU [4]. Some of these papers may
> contain useful ideas for map-reduce implementation.
> 
> There are many possible strategies for parallelization (which one is
> better is an open question for the project), for example,  one may be
> the next:
> 1) divide all inputs (provided by sequence of observations) into
> chunks of some size
> 2) set the active chunk to 1
> 3) calculate optimal paths in active chunk (in parallel mode)
> 4) set the next chunk as active and perform step 3 using optimal path
> probalities from the previous step.
> 5) repeat until there are unlabeled chunks
> 
> Project stages:
> 1. Prepare a list of all reasonable parallelization strategies,
> analyze them and discuss with the mentor/community
> 2. Select the most proper one and implement it in Mahout code as a
> Job. The implementation should work (or be compatible) with
> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel and class
> 3. Collect debug dataset and find possible bugs in the code, fix them.
> 4. Try to get a large dataset to test performance, scalability etc,
> write and perform such a test (here a community help might be needed).
> If there are any problems discuss and analyze them, then fix.
> 5. Write documentation for all classes, description of algorithm in
> wiki and complete example of usage.
> 
> 1. 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
> 2. http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
> 3. http://www.kanungo.com/software/hmmtut.pdf
> 4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
> 
> Timeline will be provided a little bit later after I get some some feedback.
> 
> That's all, so what would you say?



Reply via email to