I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google
search of "L2 norm" which turned up a number of other citations too.
Since we already use L1 (Manhattan) and L2 (Euclidean) norms in the
respective DistanceMeasures it might also be appropriate to allow the
cluster centroid normalization to be pluggable. I had never considered
that. Another mental lightbulb has just turned on. Thanks Ted.
Thinking more about DistanceMeasures, is it in general true that
'distance(X, Y)' is just 'norm(X - Y)', for some choice of norm(x)? If
so, would it be reasonable to factor the norm(x) out of the
DistanceMeasures and make that be the pluggable clustering parameter?
Hum, no, CosineDistanceMeasure normalizes X.dot(Y) by L2(X)*L2(Y), but
some refactoring might still be possible along these lines. Perhaps a
smaller number of DistanceMeasures would ensue.
Would users typically want to cluster using the same norm(x) for
distance and centroid, or should they be independently pluggable?
Jeff
Shashikant Kore wrote:
Tedd,
L^1/L^2 Normalization sounds like a good solution. I will try it out
and report the results.
Is there any literature available comparison of these normalization techniques?
Thank you.
--shashi
On Thu, May 28, 2009 at 12:30 PM, Ted Dunning <[email protected]> wrote:
Shashi,
You are correct that this can be a problem, especially with vectors that
have a large number of elements that are zero, but not known to be such.
The definition as it stands is roughly an L^0 normalization. It is more
common in clustering to use an L^1 or L^2 normalization. This would divide
the terms by, respectively, the sum of the elements or the square root of
the sum of the squares of the elements. Both L^1 and L^2 normalization
avoids the problem you mention since negligibly small elements will not
contribute significantly to the norm.
Traditionally, L^2 norms are used with documents. This dates back to Salton
and the term-vector model of text retrieval. That practice was, however,
based on somewhat inappropriate geometric intuitions. Other norms are quite
plausibly more appropriate. For instance, if normalized term frequencies
are considered to be estimates of word generation probabilities, then the
L^1 norm is much more appropriate.
On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <[email protected]>wrote:
...
My concern in the following code is that the total is divided by
numPoints. For a term, only few of the numPoints vectors have
contributed towards the weight. Rest had the value set to zero. That
drags down the average and it much more pronounced in a large set of
sparse vectors.