Jeff, Thank you for pointing the error. Not sure what I was thinking when I wrote cardinality as the denominator.
My concern in the following code is that the total is divided by numPoints. For a term, only few of the numPoints vectors have contributed towards the weight. Rest had the value set to zero. That drags down the average and it much more pronounced in a large set of sparse vectors. For example, consider following doc vectors. v1 : [0:3, 1:6, 2:0, 3:3] v2:[0:3, 1:0, 2:0, 3:6] v3: [0:0, 1:0, 2:3, 3:0] The centroid will be : Centroid: [0:2, 1:2, 2:1, 3:3] The problem I face with existing centroid calculation is out of 100k documents, only a few thousand (or even lower) documents contribute the weight of a term. When that weight is divided by 100k, the weight comes very close to zero. I am looking for ways to avoid that. If we consider only non-zero values, centroid will be Centroid: [0:3, 1:6, 2:3, 3:4.5] Is this centroid "better" if we are considering a large number of sparse vectors? --shashi On Thu, May 28, 2009 at 7:59 AM, Jeff Eastman <[email protected]> wrote: > Hi Shashi, > > I'm not sure I understand your issue. The Canopy centroid calculation > divides the individual term totals by the number of points that have been > added to the cluster, not by the cardinality of the vector: > > public Vector computeCentroid() { > Vector result = new SparseVector(pointTotal.cardinality()); > for (int i = 0; i < pointTotal.cardinality(); i++) > result.set(i, pointTotal.get(i) / numPoints); > return result; > } > > Am I misinterpreting something? > Jeff > > Shashikant Kore wrote: >> >> Hi, >> >> To calculate the centroid (say in Canopy clustering) of a set of >> sparse vectors, all the non-zero weights are added for each term and >> then divided by the cardinality of the vector. Which is the average of >> weights of a term in all the vectors. >> >> I have sparse vectors of cardinalty of 50,000+, but each vector has >> only couple of hundreds of terms. While calculating centroid, for >> each term, only few hundred documents with non-zero term weights >> contribute to the total weight, but since it is divided by the >> cardinalty(50,000), the final weight is miniscule. This results into >> small document being marked closer to the centroid as they have fewer >> terms in them. The clusters don't look "right." >> >> I am wondering if the term weights of centroid should be calculated by >> considering only the non-zero elements. That is, if a term has occurs >> in 10 vectors, then the weight of the term in centroid is the average >> of these 10 weight values. I couldn't locate any literature which >> specifically talks about the case of sparse vectors in centroid >> calculation. Any pointers are appreciated. >> >> Thanks, >> --shashi >> >> >
