[ 
https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584544#action_12584544
 ] 

Ankur commented on MAHOUT-4:
----------------------------

Hi,
           So after scratching my head for past couple of days and
understanding EM from a general perspective I built a mental model
for EM in the context of clustering.

I thought its better to share my understanding for getting inputs before
turning out the code.

So here is a short write-up in my words, please feel free to 
fill any gaps/errors found

Expectation Maximization for clustering
-----------------------------------------------------
 Let 
     z = unobserved data, clusters in our case.
     y = observed data, points in our case.
 
 We start by initializing the model paramters p(y|z) and p(z) to appropriately
 normalized random values. 
 For eg:-  let number of points = 4 and number of clusters = 2 
 then for cluster z1

 p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1
 p(z1) + p(z2) = 1

 E-Step.
 ------
      Calculate posteriori estimates of probabilities for various 
      values of the unknown z as follows:-
      
               p(y,z)                p(y|z)*p(z)
p(z|y) =  ---------  =  -----------------------------------------
                 p(y)         summation over z { p(y|z)*p(z) }
                
     Calculate expected value of log likelihood as follows:-
     
     Q(y) = summation over z { p(z|y) * log(p(y,z))}
     
     Q(y) remains unchanged if the calue calculated is  = previous Q(y) value.
     The algorithm terminates when no improvement is seen for Q(y) for all y
     
     Use Q(y) as the value for p(y,z) for M-Step to re-compute model parameters 
     p(y|z) and p(z)
     
     
 M-Step
 ------
                               p(y,z)  (from E-Step)
     p(y|z) =  --------------------------------------------------
                    summation over y { p(y,z) }   (from E-Step)
             
     p(z) = summation over y { p(y,z) }  (from E-Step)

Questions
=========
1. When and how do we re-compute the cluster centers ? 

2. As per my understanding points and clusters are simply labels with some 
   conditional probability assigned to them. A distance metric like one 
   used in K-means is nowhere involved, is that correct ? 


> Simple prototype for Expectation Maximization (EM)
> --------------------------------------------------
>
>                 Key: MAHOUT-4
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-4
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ankur
>         Attachments: Mahout_EM.patch
>
>
> Create a simple prototype implementing Expectation Maximization - EM that 
> demonstrates the algorithm functionality given a set of (user, click-url) 
> data.
> The prototype should be functionally complete and should serve as a basis for 
> the Map-Reduce version of the EM algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to