Interesting question came up recently about using the Euclidean
distance d between two vectors as a notion of their similarity.

You can use 1 / (1 + d), which mostly works, except that it
'penalizes' larger vectors, who have more dimensions along which to
differ. This is bad when those vectors are the subsets of user pref
data in which two users overlap: more overlap ought to mean higher
similarity.

I have an ancient, bad kludge in there that uses n / (1 + d), where n
is the size of the two vectors. It's trying to normalize away the
average distance between randomly-chosen vectors in the space
(remember that each dimension is bounded, between min and max rating).
But that's not n.

Is there a good formula or way of thinking about what that number
should be? I can't find it on the internet.

Reply via email to