Proposal for Implementing Hidden Markov Model
---------------------------------------------
Key: MAHOUT-396
URL: https://issues.apache.org/jira/browse/MAHOUT-396
Project: Mahout
Issue Type: New Feature
Reporter: Max Heimel
Priority: Minor
h4. Overview
This is a project proposal for a summer-term university project to write a
(sequential) HMM implementation for Mahout. Five students will work on this
project as part of a course mentored by Isabel Drost.
h4. Abstract:
Hidden Markov Models are used in multiple areas of Machine Learning, such as
speech recognition, handwritten letter recognition or natural language
processing. A Hidden Markov Model (HMM) is a statistical model of a process
consisting of two (in our case discrete) random variables O and Y, which change
their state sequentially. The variable Y with states {y_1, ... , y_n} is called
the "hidden variable", since its state is not directly observable. The state of
Y changes sequentially with a so called - in our case first-order - Markov
Property. This means, that the state change probability of Y only depends on
its current state and does not change in time. Formally we write:
P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)). The variable
O with states {o_1, ... , o_m} is called the "observable variable", since its
state can be directly observed. O does not have a Markov Property, but its
state propability depends statically on the current state of Y.
Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of
hidden states, m is the number of observable states, P is an n-dimensional
vector containing initial hidden state probabilities, A is the nxn-dimensional
"transition matrix" containing the transition probabilities such that
A[i,j]=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "observation matrix"
containing the observation probabilties such that B[i,j]= P(O=o_i|Y=y_j).
Rabiner [[1|My Page#reference1]] defined three main problems for HMM models:
# Evaluation: Given a sequence O of observations and a model M, what is the
probability P(O|M) that sequence O was generated by model M. The Evaluation
problem can be efficiently solved using the Forward algorithm
# Decoding: Given a sequence O of observations and a model M, what is the most
likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to generate this
sequence. The Decoding problem can be efficiently sovled using the Viterbi
algorithm.
# Learning: Given a sequence O of observations, what is the most likely model
M*=argmax(M)P(O|M) to generate this sequence. The Learning problem can be
efficiently solved using the Baum-Welch algorithm.
The target of each milestone is defined as the implementation for the given
goals, the respective documentation and unit tests for the implementation.
h4.Timeline
Mid of May 2010 - Mid of July 2010
h4.Milestones
I) Define an HMM class based on Apache Mahout Math package offering interfaces
to set model parameters, perform consistency checks, perform output prediction.
1 week from May 18th till May 25th.
II) Write sequential implementations of forward (cf. problem 1 [[1|My
Page#reference1]]) and backward algorithm.
2 weeks from May 25th till June 8th.
III) Write a sequential implementation of Viterbi algorithm (cf. problem 2
[[1|My Page#reference1]]), based on existing forward algorithm implementation.
2 weeks from June 8th till June 22nd
IV) Have a running sequential implementation of Baum-Welch algorithm for model
parameter learning (application II [ref]), based on existing forward/backward
algorithm implementation.
2 weeks from June 8th till June 22nd
V) Provide a usage example of HMM implementation, demonstrating all three
problems.
2 weeks from June 22nd till July 6th
VI) Finalize documentation and implemenation, clean up open ends.
1 week from July 6th till July 13th
h4.References:
{anchor:reference1}[[1|http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf]]
Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and
selected applications in speech recognition". Proceedings of the IEEE 77 (2):
257-286. doi:10.1109/5.18626.
Potential test data sets:
[http://www.cnts.ua.ac.be/conll2000/chunking/]
[http://archive.ics.uci.edu/ml/datasets/Character+Trajectories]
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.