I've most often seen something like exp(-d(x,y)) for converting distance to similarity. Unlike 1/(1+d) this has exponential decay in distance, which is usually more desirable. There is a similar kludge to what you describe, where people use exp(-d/h) for some bandwidth h. I'm not sure there's an standard way of picking h though. I've seem people use something like a sample variance from the data.
On Oct 19, 2011, at 11:15 AM, Sean Owen wrote: > There's already a cosine distance measure implementation available; > this concerns the right-est way to implement a Euclidean > distance-based measure. > > On Wed, Oct 19, 2011 at 4:12 PM, Christian Prokopp > <[email protected]> wrote: >> Do you have a particular reason for not going with cosine? >> >> On 19 October 2011 15:51, Sean Owen <[email protected]> wrote: >> >>> Interesting question came up recently about using the Euclidean >>> distance d between two vectors as a notion of their similarity. >>> >>> You can use 1 / (1 + d), which mostly works, except that it >>> 'penalizes' larger vectors, who have more dimensions along which to >>> differ. This is bad when those vectors are the subsets of user pref >>> data in which two users overlap: more overlap ought to mean higher >>> similarity. >>> >>> I have an ancient, bad kludge in there that uses n / (1 + d), where n >>> is the size of the two vectors. It's trying to normalize away the >>> average distance between randomly-chosen vectors in the space >>> (remember that each dimension is bounded, between min and max rating). >>> But that's not n. >>> >>> Is there a good formula or way of thinking about what that number >>> should be? I can't find it on the internet. >>> >> >> >> >> -- >> Christian Prokopp | Data Mining Engineer & Marie Curie Fellow >> http://www.mendeley.com/profiles/christian-prokopp/ >> >> Mendeley Limited | London, UK | www.mendeley.com >> Registered in England and Wales | Company number 6419015 >>
