[
https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586417#action_12586417
]
Ted Dunning commented on MAHOUT-4:
----------------------------------
EM clustering is very seriously prone to over-fitting if you give reasonable
flexibility to the clusters.
An important adjustment is to put a reasonable prior on the distributions being
mixed. This serves as regularization that helps avoid the problem. K-means
(sort of) avoids the problem by assuming all clusters are symmetric with
identical variance.
EM clustering can also be changed very slightly by assigning points to single
clusters chosen at random according to the probability of membership. This
turns EM clustering into Gibb's sampling. The important property that is
changed is that you now can sample over the distribution of possible samplings
which can be very important if some parts of your data are well defined and
some parts not so well defined.
Further extension can also be made by assuming an infinite mixture model. The
implementation is only slightly more difficult and the result is a (nearly)
non-parametric clustering algorithm. I will attach an R implementation for
reference.
> Simple prototype for Expectation Maximization (EM)
> --------------------------------------------------
>
> Key: MAHOUT-4
> URL: https://issues.apache.org/jira/browse/MAHOUT-4
> Project: Mahout
> Issue Type: New Feature
> Reporter: Ankur
> Attachments: Mahout_EM.patch
>
>
> Create a simple prototype implementing Expectation Maximization - EM that
> demonstrates the algorithm functionality given a set of (user, click-url)
> data.
> The prototype should be functionally complete and should serve as a basis for
> the Map-Reduce version of the EM algorithm.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.