[ 
https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586417#action_12586417
 ] 

Ted Dunning commented on MAHOUT-4:
----------------------------------

EM clustering is very seriously prone to over-fitting if you give reasonable 
flexibility to the clusters.

An important adjustment is to put a reasonable prior on the distributions being 
mixed.  This serves as regularization that helps avoid the problem.  K-means 
(sort of) avoids the problem by assuming all clusters are symmetric with 
identical variance.

EM clustering can also be changed very slightly by assigning points to single 
clusters chosen at random according to the probability of membership.  This 
turns EM clustering into Gibb's sampling.  The important property that is 
changed is that you now can sample over the distribution of possible samplings 
which can be very important if some parts of your data are well defined and 
some parts not so well defined.

Further extension can also be made by assuming an infinite mixture model.  The 
implementation is only slightly more difficult and the result is a (nearly) 
non-parametric clustering algorithm.  I will attach an R implementation for 
reference.


> Simple prototype for Expectation Maximization (EM)
> --------------------------------------------------
>
>                 Key: MAHOUT-4
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-4
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ankur
>         Attachments: Mahout_EM.patch
>
>
> Create a simple prototype implementing Expectation Maximization - EM that 
> demonstrates the algorithm functionality given a set of (user, click-url) 
> data.
> The prototype should be functionally complete and should serve as a basis for 
> the Map-Reduce version of the EM algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to