Hi Ted,

In HAC algorithms, a large number of dot product computations are required.
So, we can use an inverted index ( Lucene Index) to improve the performance.
We can come up with a practical formula for similarity computations as done
in Lucene scoring.

 Most of the time, documents being clustered are high dimensional sparse
vectors. So the required number of computations are small. But one exception
is the case of dense centroids in using Group-average agglomerative
clustering. This issue can be addressed by using medoids (The document
vector that is closest to the centroid) instead of dense centroids.

 Anyway, Did Mahout addressed this dense centroid issue in K-means
implementation ?

 However, with very large datasets, HAC is infeasible. In such scenarios we
can use an HAC algorithm with a low threshold to compute high quality seeds
( currently Canopy is used) to K-means algorithm.

 Even though there are limitations, I believe, It is suitable and worthwhile
to include HAC in Mahout.


Thanks,

Lahiru

Reply via email to