Hi Ted, In HAC algorithms, a large number of dot product computations are required. So, we can use an inverted index ( Lucene Index) to improve the performance. We can come up with a practical formula for similarity computations as done in Lucene scoring.
Most of the time, documents being clustered are high dimensional sparse vectors. So the required number of computations are small. But one exception is the case of dense centroids in using Group-average agglomerative clustering. This issue can be addressed by using medoids (The document vector that is closest to the centroid) instead of dense centroids. Anyway, Did Mahout addressed this dense centroid issue in K-means implementation ? However, with very large datasets, HAC is infeasible. In such scenarios we can use an HAC algorithm with a low threshold to compute high quality seeds ( currently Canopy is used) to K-means algorithm. Even though there are limitations, I believe, It is suitable and worthwhile to include HAC in Mahout. Thanks, Lahiru
