> Yes, I'd thought and discovered a lot for the first stage and at the
> moment I have a lot of ideas. Where it's better to discuss them - here
> on in Jira.

imo here. Jira should probably contain just the summary of what you
ultimately decided to do.

-d



> I'm going to create Jira ticket with completed proposal on today/tomorrow.
>>
>> Cheers,
>> Grant
>>
>> On Mar 28, 2011, at 8:46 AM, Sergey Bartunov wrote:
>>
>>> Hi! Here is my proposal draft for GSoC. Your comments, questions, etc
>>> are welcome.
>>>
>>> I already wrote to the mailing list about my GSoC idea which was
>>> implementing parallel versions of HMM algorithms, and since Dhruv
>>> Kumar has taken the Baum-Welch algorithm
>>> (https://issues.apache.org/jira/browse/MAHOUT-627), which is training
>>> algorithm for HMM, I'd like to implement Viterbi algorithm (which is
>>> evaluating one). If our projects will succeed then HMM functionality
>>> in Mahout will be ready to work in parallel mode and therefore to
>>> handle big amounts of data both for training and evaluating. Since HMM
>>> has many applications such as speech recognition and generation,
>>> natural language processing, bioinformatics etc, and at the moment
>>> Mahout lacks a little for graphical models (and as far as I know
>>> distributed/parallel algorithms for graphical models is still not
>>> studied a lot), such a contribution will be helpful for the community.
>>>
>>> The Viterbi algorithms is a quite common dynamic programming
>>> algorithm, it's well described in many sources such as [1], [2], [3].
>>> As being popular and needed, Viterbi algorithm already have parallel
>>> versions which became a subject of several studies, for example, this
>>> paper about implementation for GPU [4]. Some of these papers may
>>> contain useful ideas for map-reduce implementation.
>>>
>>> There are many possible strategies for parallelization (which one is
>>> better is an open question for the project), for example,  one may be
>>> the next:
>>> 1) divide all inputs (provided by sequence of observations) into
>>> chunks of some size
>>> 2) set the active chunk to 1
>>> 3) calculate optimal paths in active chunk (in parallel mode)
>>> 4) set the next chunk as active and perform step 3 using optimal path
>>> probalities from the previous step.
>>> 5) repeat until there are unlabeled chunks
>>>
>>> Project stages:
>>> 1. Prepare a list of all reasonable parallelization strategies,
>>> analyze them and discuss with the mentor/community
>>> 2. Select the most proper one and implement it in Mahout code as a
>>> Job. The implementation should work (or be compatible) with
>>> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel and class
>>> 3. Collect debug dataset and find possible bugs in the code, fix them.
>>> 4. Try to get a large dataset to test performance, scalability etc,
>>> write and perform such a test (here a community help might be needed).
>>> If there are any problems discuss and analyze them, then fix.
>>> 5. Write documentation for all classes, description of algorithm in
>>> wiki and complete example of usage.
>>>
>>> 1. 
>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
>>> 2. http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
>>> 3. http://www.kanungo.com/software/hmmtut.pdf
>>> 4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
>>>
>>> Timeline will be provided a little bit later after I get some some feedback.
>>>
>>> That's all, so what would you say?
>>
>>
>>
>>
>

Reply via email to