Re: Centroid calculations with sparse vectors

Jeff Eastman Thu, 28 May 2009 05:51:42 -0700

I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Googlesearch of "L2 norm" which turned up a number of other citations too.Since we already use L1 (Manhattan) and L2 (Euclidean) norms in therespective DistanceMeasures it might also be appropriate to allow thecluster centroid normalization to be pluggable. I had never consideredthat. Another mental lightbulb has just turned on. Thanks Ted.

Thinking more about DistanceMeasures, is it in general true that'distance(X, Y)' is just 'norm(X - Y)', for some choice of norm(x)? Ifso, would it be reasonable to factor the norm(x) out of theDistanceMeasures and make that be the pluggable clustering parameter?Hum, no, CosineDistanceMeasure normalizes X.dot(Y) by L2(X)*L2(Y), butsome refactoring might still be possible along these lines. Perhaps asmaller number of DistanceMeasures would ensue.

Would users typically want to cluster using the same norm(x) fordistance and centroid, or should they be independently pluggable?


Jeff


Shashikant Kore wrote:

Tedd,

L^1/L^2 Normalization sounds like a  good solution. I will try it out
and report the results.

Is there any literature available comparison of these normalization techniques?

Thank you.

--shashi

On Thu, May 28, 2009 at 12:30 PM, Ted Dunning <[email protected]> wrote:

Shashi,

You are correct that this can be a problem, especially with vectors that
have a large number of elements that are zero, but not known to be such.

The definition as it stands is roughly an L^0 normalization.  It is more
common in clustering to use an L^1 or L^2 normalization.  This would divide
the terms by, respectively, the sum of the elements or the square root of
the sum of the squares of the elements.  Both L^1 and L^2 normalization
avoids the problem you mention since negligibly small elements will not
contribute significantly to the norm.

Traditionally, L^2 norms are used with documents.  This dates back to Salton
and the term-vector model of text retrieval.  That practice was, however,
based on somewhat inappropriate geometric intuitions.  Other norms are quite
plausibly more appropriate.  For instance, if normalized term frequencies
are considered to be estimates of word generation probabilities, then the
L^1 norm is much more appropriate.

On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <[email protected]>wrote:

...
My concern in the following code is that the total is divided by
numPoints.  For a term,  only few of the numPoints vectors have
contributed towards the weight. Rest had the value set to zero. That
drags down the average and it much more pronounced in a large set of
sparse vectors.

Re: Centroid calculations with sparse vectors

Reply via email to