[
https://issues.apache.org/jira/browse/MAHOUT-396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891416#action_12891416
]
Max Heimel commented on MAHOUT-396:
-----------------------------------
The latest patch contains the finished implementation and test suite of
sequential Hidden Markov Models for Apache Mahout and should be ready for
review.
The implementation now offers all the required functionality (initialization,
evaluation, training, loading/storing) and should work fairly efficient. All
algorithms are implemented to optionally use log-likelihoods to trade off
computation time for increased numerical stability in case of long output
sequences. Models can be serialized and deserialized from Json. There are three
training algorithms tackling different usage scenarios (supervised, Viterbi,
Baum-Welch) available.
I included a small demo-program
(org.apache.mahout.classifier.sequencelearning.hmm.example.PosTagger.java) that
demonstrates usage of the model class by implementing a simple
PartOfSpeech-tagger using the training/test data sets from
http://flexcrfs.sourceforge.net/#Case_Study. Using simple supervised learning
on the training data set & assuming unknown words to be of type PNP (proper
noun present), the tagger reaches a test accuracy of 94%. The program output
with timings on my Athlon 2.6Ghz dual core with 4GB of RAM running Ubuntu 10.04
is:
"Using URL: http://www.jaist.ac.jp/~hieuxuan/flexcrfs/CoNLL2000-NP/train.txt
Reading and parsing training data file ... done in 22.479 seconds!
Read 211727 lines containing 8936 sentences with a total of 19122
distinct words and 43 distinct POS tags.
Training HMM model ... done in 0.719 seconds!
Reading and parsing test data file ... done in 3.189 seconds!
Read 47377 lines containing 2012 sentences.
POS tagging test data file ... done in 0.475 seconds!
Tagged the test file with an error rate of: 0.060176879076345065"
There are still some open ends that need to be looked into. The major points
are:
1) Redesigning HmmAlgorithms to handle SparseMatrix/SparseVector more
efficiently (via nonZeroIterator)
2) Serializing and Deserializing of a model is currently possible using JSON.
However, for large models this becomes highly inefficient. E.g. serializing the
model created in the PosTagging example results in an 18MB JSO file that takes
much(!) longer to deserialze than a reconstruction from the training data
takes. It would thus probably be a good idea to look into binary
serialization/deserialization.
3) Convergence check for Viterbi/Baum-Welch: currently the convergenc check
uses an "ad-hoc" matrix difference. Since BW is an EM algorithm, it would
probably be mathematically more sound to switch to using the model likelihood
for the convergence check.
4) Parallelization: The traditional algorithms
(forward,backward,viterbi,baum-welch) are probably hard to parallelize using
M/R - there is some prior work regarding parallelization, but I haven't quite
looked into this yet. A more suitable approach to HMM parallelization would be
to train in parallel on multiple sequences (e.g. one per mapper) and then merge
the resulting HMMs in the reducer step.
> Proposal for Implementing Hidden Markov Model
> ---------------------------------------------
>
> Key: MAHOUT-396
> URL: https://issues.apache.org/jira/browse/MAHOUT-396
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.4
> Reporter: Max Heimel
> Assignee: Max Heimel
> Priority: Minor
> Fix For: 0.4
>
> Attachments: MAHOUT-396.diff, MAHOUT-396.diff, MAHOUT-396.diff,
> MAHOUT-396.diff, MAHOUT-396.diff
>
>
> h4. Overview
> This is a project proposal for a summer-term university project to write a
> (sequential) HMM implementation for Mahout. Five students will work on this
> project as part of a course mentored by Isabel Drost.
> h4. Abstract:
> Hidden Markov Models are used in multiple areas of Machine Learning, such as
> speech recognition, handwritten letter recognition or natural language
> processing. A Hidden Markov Model (HMM) is a statistical model of a process
> consisting of two (in our case discrete) random variables O and Y, which
> change their state sequentially. The variable Y with states {y_1, ... , y_n}
> is called the "hidden variable", since its state is not directly observable.
> The state of Y changes sequentially with a so called - in our case
> first-order - Markov Property. This means, that the state change probability
> of Y only depends on its current state and does not change in time. Formally
> we write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)).
> The variable O with states {o_1, ... , o_m} is called the "observable
> variable", since its state can be directly observed. O does not have a Markov
> Property, but its state propability depends statically on the current state
> of Y.
> Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number
> of hidden states, m is the number of observable states, P is an n-dimensional
> vector containing initial hidden state probabilities, A is the
> nxn-dimensional "transition matrix" containing the transition probabilities
> such that A[i,j]=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional
> "observation matrix" containing the observation probabilties such that
> B[i,j]= P(O=o_i|Y=y_j).
> Rabiner [[1|My Page#reference1]] defined three main problems for HMM models:
> # Evaluation: Given a sequence O of observations and a model M, what is the
> probability P(O|M) that sequence O was generated by model M. The Evaluation
> problem can be efficiently solved using the Forward algorithm
> # Decoding: Given a sequence O of observations and a model M, what is the
> most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to generate
> this sequence. The Decoding problem can be efficiently sovled using the
> Viterbi algorithm.
> # Learning: Given a sequence O of observations, what is the most likely model
> M*=argmax(M)P(O|M) to generate this sequence. The Learning problem can be
> efficiently solved using the Baum-Welch algorithm.
> The target of each milestone is defined as the implementation for the given
> goals, the respective documentation and unit tests for the implementation.
> h4.Timeline
> Mid of May 2010 - Mid of July 2010
> h4.Milestones
> I) Define an HMM class based on Apache Mahout Math package offering
> interfaces to set model parameters, perform consistency checks, perform
> output prediction.
> 1 week from May 18th till May 25th.
> II) Write sequential implementations of forward (cf. problem 1 [[1|My
> Page#reference1]]) and backward algorithm.
> 2 weeks from May 25th till June 8th.
> III) Write a sequential implementation of Viterbi algorithm (cf. problem 2
> [[1|My Page#reference1]]), based on existing forward algorithm implementation.
> 2 weeks from June 8th till June 22nd
> IV) Have a running sequential implementation of Baum-Welch algorithm for
> model parameter learning (application II [ref]), based on existing
> forward/backward algorithm implementation.
> 2 weeks from June 8th till June 22nd
> V) Provide a usage example of HMM implementation, demonstrating all three
> problems.
> 2 weeks from June 22nd till July 6th
> VI) Finalize documentation and implemenation, clean up open ends.
> 1 week from July 6th till July 13th
> h4.References:
> {anchor:reference1}[[1|http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf]]
> Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and
> selected applications in speech recognition". Proceedings of the IEEE 77 (2):
> 257-286. doi:10.1109/5.18626.
> Potential test data sets:
> [http://www.cnts.ua.ac.be/conll2000/chunking/]
> [http://archive.ics.uci.edu/ml/datasets/Character+Trajectories]
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.