Seems to be the wrong way around indeed. I don't think the normalization can be used in the distributed implementation anymore as the number of overlapping dimensions is not known anymore (this is information is lost because we only have the dot product between the vectors and their squares at hand). If you find a way to fit it in, we should change it.
--sebastian 2011/10/19 Sean Owen <[email protected]>: > Sebastian: I had a look at the distributed Euclidean similarity and it > computes similarity as ... > > 1 - 1 / (1+d). This is the wrong way around right? Higher distance > moves the value to 1. > > For consistency, I'm looking to stick with a 1/(1+d) expression for > now (unless someone tells me that's just theoretically inferior for > sure). > > I'm thinking of 1 / (1 + d/sqrt(n)) as a better attempt at normalizing > away the effect of more dimensions. > > How's that sound, and shall I make the distributed version behave similarly? > > On Wed, Oct 19, 2011 at 3:51 PM, Sean Owen <[email protected]> wrote: >> Interesting question came up recently about using the Euclidean >> distance d between two vectors as a notion of their similarity. >> >> You can use 1 / (1 + d), which mostly works, except that it >> 'penalizes' larger vectors, who have more dimensions along which to >> differ. This is bad when those vectors are the subsets of user pref >> data in which two users overlap: more overlap ought to mean higher >> similarity. >> >> I have an ancient, bad kludge in there that uses n / (1 + d), where n >> is the size of the two vectors. It's trying to normalize away the >> average distance between randomly-chosen vectors in the space >> (remember that each dimension is bounded, between min and max rating). >> But that's not n. >> >> Is there a good formula or way of thinking about what that number >> should be? I can't find it on the internet. >> >
