[
https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586428#action_12586428
]
Isabel Drost commented on MAHOUT-4:
-----------------------------------
> An important adjustment is to put a reasonable prior on the distributions
> being mixed. This serves as regularization that helps avoid the problem.
> K-means (sort of) avoids the problem by assuming all clusters are symmetric
> with identical variance.
I think you could impose the same restriction to EM as well?
> EM clustering can also be changed very slightly by assigning points to
> single clusters chosen at random according to the probability of
> membership. This turns EM clustering into Gibb's sampling.
That is the simplest explanation of Gibb's sampling I have read so far :)
> Further extension can also be made by assuming an infinite mixture model.
> The implementation is only slightly more difficult and the result is a
> (nearly) non-parametric clustering algorithm. I will attach an R
> implementation for reference.
I think the dirichlet process based clustering comes with the handy property
that you can avoid passing the number of parameters into the algorithm, right?
To me that seems better for realistic settings where you usually have some data
available but you cannot tell how many clusters are there.
Do you think, one can solve the original PLSI problem with Gibb's sampling or
an infinite mixture model as well? After all the original patch was about
integrating PLSI that is based on EM. I wonder whether one should split this
thread into at least four threads:
- EM implementation
- Gibb's sampling implementation
- dirichlet process implementation
- PLSI based on EM
What do you think?
> Simple prototype for Expectation Maximization (EM)
> --------------------------------------------------
>
> Key: MAHOUT-4
> URL: https://issues.apache.org/jira/browse/MAHOUT-4
> Project: Mahout
> Issue Type: New Feature
> Reporter: Ankur
> Attachments: dp-cluster.r, Mahout_EM.patch
>
>
> Create a simple prototype implementing Expectation Maximization - EM that
> demonstrates the algorithm functionality given a set of (user, click-url)
> data.
> The prototype should be functionally complete and should serve as a basis for
> the Map-Reduce version of the EM algorithm.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.