Hi, To calculate the centroid (say in Canopy clustering) of a set of sparse vectors, all the non-zero weights are added for each term and then divided by the cardinality of the vector. Which is the average of weights of a term in all the vectors.
I have sparse vectors of cardinalty of 50,000+, but each vector has only couple of hundreds of terms. While calculating centroid, for each term, only few hundred documents with non-zero term weights contribute to the total weight, but since it is divided by the cardinalty(50,000), the final weight is miniscule. This results into small document being marked closer to the centroid as they have fewer terms in them. The clusters don't look "right." I am wondering if the term weights of centroid should be calculated by considering only the non-zero elements. That is, if a term has occurs in 10 vectors, then the weight of the term in centroid is the average of these 10 weight values. I couldn't locate any literature which specifically talks about the case of sparse vectors in centroid calculation. Any pointers are appreciated. Thanks, --shashi -- http://www.bandhan.com/
