On 5 April 2011 17:22, Grant Ingersoll <[email protected]> wrote:
> Hi Sergey,
>
> Sorry for the delay, I've been under a rock for the last few weeks. Please
> open a JIRA issue for your proposal. Overall, I think it sounds
> reasonable, but please do think about timelines a bit more when you submit.
> I'm not totally up on the techniques just yet, so I will need to go do some
> background reading.
>
> I would also think about whether you can narrow down step 1 of your project
> stage. Is this something you can do up front a bit more? I'd hate to see
> that bleed into weeks and weeks of the start of the project.
Yes, I'd thought and discovered a lot for the first stage and at the
moment I have a lot of ideas. Where it's better to discuss them - here
on in Jira.
I'm going to create Jira ticket with completed proposal on today/tomorrow.
>
> Cheers,
> Grant
>
> On Mar 28, 2011, at 8:46 AM, Sergey Bartunov wrote:
>
>> Hi! Here is my proposal draft for GSoC. Your comments, questions, etc
>> are welcome.
>>
>> I already wrote to the mailing list about my GSoC idea which was
>> implementing parallel versions of HMM algorithms, and since Dhruv
>> Kumar has taken the Baum-Welch algorithm
>> (https://issues.apache.org/jira/browse/MAHOUT-627), which is training
>> algorithm for HMM, I'd like to implement Viterbi algorithm (which is
>> evaluating one). If our projects will succeed then HMM functionality
>> in Mahout will be ready to work in parallel mode and therefore to
>> handle big amounts of data both for training and evaluating. Since HMM
>> has many applications such as speech recognition and generation,
>> natural language processing, bioinformatics etc, and at the moment
>> Mahout lacks a little for graphical models (and as far as I know
>> distributed/parallel algorithms for graphical models is still not
>> studied a lot), such a contribution will be helpful for the community.
>>
>> The Viterbi algorithms is a quite common dynamic programming
>> algorithm, it's well described in many sources such as [1], [2], [3].
>> As being popular and needed, Viterbi algorithm already have parallel
>> versions which became a subject of several studies, for example, this
>> paper about implementation for GPU [4]. Some of these papers may
>> contain useful ideas for map-reduce implementation.
>>
>> There are many possible strategies for parallelization (which one is
>> better is an open question for the project), for example, one may be
>> the next:
>> 1) divide all inputs (provided by sequence of observations) into
>> chunks of some size
>> 2) set the active chunk to 1
>> 3) calculate optimal paths in active chunk (in parallel mode)
>> 4) set the next chunk as active and perform step 3 using optimal path
>> probalities from the previous step.
>> 5) repeat until there are unlabeled chunks
>>
>> Project stages:
>> 1. Prepare a list of all reasonable parallelization strategies,
>> analyze them and discuss with the mentor/community
>> 2. Select the most proper one and implement it in Mahout code as a
>> Job. The implementation should work (or be compatible) with
>> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel and class
>> 3. Collect debug dataset and find possible bugs in the code, fix them.
>> 4. Try to get a large dataset to test performance, scalability etc,
>> write and perform such a test (here a community help might be needed).
>> If there are any problems discuss and analyze them, then fix.
>> 5. Write documentation for all classes, description of algorithm in
>> wiki and complete example of usage.
>>
>> 1.
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
>> 2. http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
>> 3. http://www.kanungo.com/software/hmmtut.pdf
>> 4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
>>
>> Timeline will be provided a little bit later after I get some some feedback.
>>
>> That's all, so what would you say?
>
>
>
>