[ 
https://issues.apache.org/jira/browse/SPARK-17716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-17716:
---------------------------------
    Labels: bulk-closed  (was: )

> Hidden Markov Model (HMM)
> -------------------------
>
>                 Key: SPARK-17716
>                 URL: https://issues.apache.org/jira/browse/SPARK-17716
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Runxin Li
>            Priority: Major
>              Labels: bulk-closed
>
> Had an offline chat with [~Lil'Rex], who implemented HMM on Spark at 
> https://github.com/apache/spark/compare/master...lilrex:sequence. I asked him 
> to list popular HMM applications, describe public API (params, input/output 
> schemas), compare its API with existing HMM implementations.
> h1. Hidden Markov Model (HMM) Design Doc
> h2. Overview
> h3. Introduction to HMM
> Hidden Markov Model is a type of statistical Machine Learning model that 
> assumes a sequence of observations is generated by a Markov process with 
> hidden states. There are 3 (or 2, depending on the implementation) main 
> components of the model:
> * *Transition Probability*: describes the probability distribution of 
> transitions from each state to other states (including self) in the Markov 
> process
> * *Emission Probability*: describes the probability distribution for an 
> observation associated with hidden states
> * *Initial/Start Probability* (optional): represents the prior probability of 
> each state at the beginning of the observation sequence
> _Note: some implementations merge the Initial Probability into Transition 
> Probability by adding an arbitrary Start state before the first observation 
> point._
> h3. HMM Models and Algorithms
> Given a limited number of states, most HMM models have the same form of 
> Transition Probability: a matrix, where each element _(i, j)_ represents the 
> probability of transition from state _i_ to state _j_. The Initial 
> Probability usually take the simple form of a probabilistic vector.
> The Emission Probability, on the other hand, can be represented in many 
> different ways, depending on different nature of observations, i.e. 
> continuous vs. discrete, or different model assumptions, e.g. single Gaussian 
> vs. Gaussian Mixtures.
> There are three main problems associated with HMM models, and their canonical 
> algorithms:
> # *Evaluation*: What’s the probability of a given observation sequence, based 
> on the model? It’s usually done by either *Forward* or *Backward* algorithms
> # *Decoding*: What’s the most likely state sequence, given the observation 
> sequence and the model? It’s usually done by *Viterbi* decoding
> # *Learning*: How to train the parameters of the model based on the 
> observation sequences? *Baum-Welch* (Forward-Backward) is usually used as 
> part of the *EM* algorithm in unsupervised training
> h3. Popular Applications of HMM
> * Speech Recognition
> * Part-of-speech Tagging
> * Named Entity Recognition
> * Machine Translation
> * Gene Prediction
> h2. Alternate Libraries
> [Mahout|https://mahout.apache.org/users/classification/hidden-markov-models.html]
> * Assume emission probability to be an m-by-n matrix
> * Use Baum-Welch algorithm for training and Viterbi algorithm for prediction
> * API (command line)
> ** training
> {{$ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 
> 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 
> 2 2 3 2 1 1 0" > hmm-input}}
> {{$ export MAHOUT_LOCAL=true}}
> {{$ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 
> -e .0001 -m 1000}}
> ** prediction
> {{$ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10}}
> {{$ cat hmm-predictions}}
> [Mallet|http://mallet.cs.umass.edu/api/cc/mallet/fst/HMM.html]
> * Treat HMM as a Finite State Transducer (FST)
> * Theoretically can go beyond first-order Markov assumption by setting an 
> arbitrary order
> * Limited to text data, i.e. discrete observation sequence with Multinomial 
> emission model assumption
> * Supervised training only
> * API:
> ** Training:
> {{HMM hmm = new HMM(pipe, null);}}
> {{hmm.addStatesForLabelsConnectedAsIn(trainingInstances);}}
> {{HMMTrainerByLikelihood trainer = new HMMTrainerByLikelihood(hmm);}}
> {{trainer.train(trainingInstances, 10);}}
> ** Testing:
> {{evaluator.evaluate(trainer);}}
> [HMMLearn|https://github.com/hmmlearn/hmmlearn]
> * Previously part of scikit-learn
> * Algo:
> ** Standard HMM unsupervised training algorithm
> ** Three types of emission models: GMM, Gaussian and Multinomial
> * API:
> ** Training: 
> {{model = hmm.GaussianHMM(n_components=3, covariance_type="full")}}
> {{model.fit(X)}}
> ** Testing: 
> {{hidden_states = model.predict(X)}}
> h2. API
> Design Goals
> * Build the foundation for general Sequential Tagging models (HMM, CRF, etc.)
> * Support multiple Emission Probability models such as “Multinomial” and 
> “Gaussian Mixture”
> * Keep both supervised and unsupervised learning for HMM in mind
> h3. Proposed API
> _Note: This is written for the spark.ml API._
> Decoder API
> {code:title=Decoder.scala}
> trait DecoderParams extends Params {
>   def featureCol: DataType // column of sequential features, e.g. MatrixUDT
>   def predictionCol: DoubleType // column of prediction
>   def labelCol: DataType // column of sequential labels, e.g. VectorUDT
> }
> abstract class Decoder extends Estimator with DecoderParams {
>   def extractLabeledSequences(dataset: Dataset[_]): RDD[LabeledSequence]
> }
> abstract class DecodingModel extends Model with DecoderParams {
>   def numFeatures: Int
>   def decode(features: FeatureType): Vector
> }
> {code}
> Tagger API
> {code:title=Tagger.scala}
> trait TaggerParams extends DecoderParams with HasRawPredictionCol {
>   def rawPredictionCol: MatrixUDT // column for all predicted label sequences
> }
> abstract class Tagger extends Decoder with TaggerParams
> abstract class TaggingModel extends DecodingModel with TaggerParams {
>   def numClasses: Int
>   def decodeRaw(features: FeaturesType): Array[(Double, Vector)]
>   def raw2prediction(rawPrediction: Array[(Double, Vector)]): Vector
> }
> {code}
> MarginalTagger (Probabilistic Tagger) API
> {code:title=MarginalTagger.scala}
> trait MarginalTaggerParams extends TaggerParams with HasProbabilityCol with 
> HasThreshold {
>   def probabilityCol: DoubleType // column for probability
> }
> abstract class MarginalTagger extends Tagger with MarginalTaggerParams
> abstract class MarginalTaggingModel extends TaggingModel with 
> MarginalTaggerParams {
>   def getMargin(featuers: FeaturesType): Double
> }
> {code}
> HiddenMarkovModel API
> {code:title=HiddenMarkovModel.scala}
> trait HiddenMarkovModelParams extends MarginalTaggerParams with HasMaxIter 
> with HasTol with HasStandardization with HasThreshold {
>   def smoothing: DoubleParam
>   def emissionType: Param[String] //can be either Multinomial or Gaussian
> }
> class HiddenMarkovModel extends MarginalTagger with HiddenMarkovModelParams {
>   def initialModel: Option[HMMModel] //initial model before training
>   def def trainSupervised(dataset: Dataset[_]): Option[HMMModel]
>   def trainUnsupervised(dataset: Dataset[_]): HMMModel
> }
> abstract class HMMModel extends MarginalTaggingModel with 
> HiddenMarkovModelParams {
>   def initialProb: Vector // initial probability for states
>   def transitionProb: DenseMatrix // transition probability between states
>   //compute feature scores for all states in all frames in a sequence, 
> different for different emission models, e.g. Multinomial vs. GMM
>   def precomputeEmission(features: Matrix): List[Array[Double]] 
>   // accumulate sufficient statistics for emission model  
>   def getEmissionStats(features: Matrix, gammas: List[Array[Double]]): 
> DenseMatrix
>   // forward algorithm
>   def forward(emissions: Traversable[Array[Double]]): List[Array[Double]] 
>   // backword algorithm
>   def backward(emissions: Traversable[Array[Double]]):List[Array[Double]] 
> }
> class MultinomialHMMModel extends HMMModel {
>   def emissionProb: Matrix // emission probability from states to observations
> }
> object MultinomialHMMModel extends MLReadable[MultinomialHMMModel]
> class HMMModelReader extends MLReader[MultinomialHMMModel]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to