I think the scalability problems you are seeing are a consequence of using the 
default GaussianCluster models. These models perform especially poorly for 
large text clustering problems such as email. The pdf() calculation over wide 
topic vectors does a lot of complicated math for each term pdf and then 
underflows on the combined pdf() product to boot. I've updated build-reuters.sh 
to use a CosineDistanceMeasure and a DistanceMeasureCluster instead and the 
performance has improved over 100x on Reuters. So has, evidently, the quality 
of the clustering. See recent posts "Dirichlet Process Clustering not working".

I've warped my brain trying to figure out how to use a combiner with Dirichlet 
and don't see how to do it. I'm open to ideas if anybody else has some.

-----Original Message-----
From: Grant Ingersoll [mailto:[email protected]] 
Sent: Wednesday, November 02, 2011 2:14 PM
To: [email protected]
Subject: Dirchlet

Tim Potter and I have tried running Dirchlet in the past on the ASF email set 
on EC2 and it didn't seem to scale all that well, so I was wondering if people 
had ideas on improving it's speed.  One question I had is whether we could 
inject a Combiner into the process?  Ted also mentioned that there might be 
faster ways to check the models, but I will ask him to elaborate.

Thanks,
Grant

Reply via email to