Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

Jeff Eastman Wed, 03 Feb 2010 14:53:36 -0800

Ted Dunning wrote:

This could also be caused if the prior is very diffuse.  This makes the
probability that a point will go to any new cluster quite low.  You can
compensate somewhat for this with different values of alpha.

Could you elaborate more on the function of alpha in the algorithm?Looking at the current implementation, it is only used to initialize thetotalCount values (to alpha/k) when sampling from the prior. AFAICT itis not used anywhere else. Its current role is pretty minimal and Iwonder if something fell through the cracks during all of therefactoring from the R prototype.

I have had some half thoughts about how to improve the mixing and currently
think that starting conditions may be the trick.  Using something like
k-means++ to initialize the clusters might help enormously.

If it helps k-means it likely would help Dirichlet too. Currently allthe prior sampling is done by model distributions with no knowledge ofthe dataset via random processes. I looked for a patch for MAHOUT-153but did not see one yet.

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

Reply via email to