I think the scalability problems you are seeing are a consequence of using the default GaussianCluster models. These models perform especially poorly for large text clustering problems such as email. The pdf() calculation over wide topic vectors does a lot of complicated math for each term pdf and then underflows on the combined pdf() product to boot. I've updated build-reuters.sh to use a CosineDistanceMeasure and a DistanceMeasureCluster instead and the performance has improved over 100x on Reuters. So has, evidently, the quality of the clustering. See recent posts "Dirichlet Process Clustering not working".
I've warped my brain trying to figure out how to use a combiner with Dirichlet and don't see how to do it. I'm open to ideas if anybody else has some. -----Original Message----- From: Grant Ingersoll [mailto:[email protected]] Sent: Wednesday, November 02, 2011 2:14 PM To: [email protected] Subject: Dirchlet Tim Potter and I have tried running Dirchlet in the past on the ASF email set on EC2 and it didn't seem to scale all that well, so I was wondering if people had ideas on improving it's speed. One question I had is whether we could inject a Combiner into the process? Ted also mentioned that there might be faster ways to check the models, but I will ask him to elaborate. Thanks, Grant
