Sebastian: I had a look at the distributed Euclidean similarity and it computes similarity as ...
1 - 1 / (1+d). This is the wrong way around right? Higher distance moves the value to 1. For consistency, I'm looking to stick with a 1/(1+d) expression for now (unless someone tells me that's just theoretically inferior for sure). I'm thinking of 1 / (1 + d/sqrt(n)) as a better attempt at normalizing away the effect of more dimensions. How's that sound, and shall I make the distributed version behave similarly? On Wed, Oct 19, 2011 at 3:51 PM, Sean Owen <[email protected]> wrote: > Interesting question came up recently about using the Euclidean > distance d between two vectors as a notion of their similarity. > > You can use 1 / (1 + d), which mostly works, except that it > 'penalizes' larger vectors, who have more dimensions along which to > differ. This is bad when those vectors are the subsets of user pref > data in which two users overlap: more overlap ought to mean higher > similarity. > > I have an ancient, bad kludge in there that uses n / (1 + d), where n > is the size of the two vectors. It's trying to normalize away the > average distance between randomly-chosen vectors in the space > (remember that each dimension is bounded, between min and max rating). > But that's not n. > > Is there a good formula or way of thinking about what that number > should be? I can't find it on the internet. >
