I've most often seen something like exp(-d(x,y)) for converting distance to 
similarity.  Unlike 1/(1+d) this has exponential decay in distance, which is 
usually more desirable.  There is a similar kludge to what you describe, where 
people use exp(-d/h) for some bandwidth h.  I'm not sure there's an standard 
way of picking h though.  I've seem people use something like a sample variance 
from the data.


On Oct 19, 2011, at 11:15 AM, Sean Owen wrote:

> There's already a cosine distance measure implementation available;
> this concerns the right-est way to implement a Euclidean
> distance-based measure.
> 
> On Wed, Oct 19, 2011 at 4:12 PM, Christian Prokopp
> <[email protected]> wrote:
>> Do you have a particular reason for not going with cosine?
>> 
>> On 19 October 2011 15:51, Sean Owen <[email protected]> wrote:
>> 
>>> Interesting question came up recently about using the Euclidean
>>> distance d between two vectors as a notion of their similarity.
>>> 
>>> You can use 1 / (1 + d), which mostly works, except that it
>>> 'penalizes' larger vectors, who have more dimensions along which to
>>> differ. This is bad when those vectors are the subsets of user pref
>>> data in which two users overlap: more overlap ought to mean higher
>>> similarity.
>>> 
>>> I have an ancient, bad kludge in there that uses n / (1 + d), where n
>>> is the size of the two vectors. It's trying to normalize away the
>>> average distance between randomly-chosen vectors in the space
>>> (remember that each dimension is bounded, between min and max rating).
>>> But that's not n.
>>> 
>>> Is there a good formula or way of thinking about what that number
>>> should be? I can't find it on the internet.
>>> 
>> 
>> 
>> 
>> --
>> Christian Prokopp | Data Mining Engineer & Marie Curie Fellow
>> http://www.mendeley.com/profiles/christian-prokopp/
>> 
>> Mendeley Limited | London, UK | www.mendeley.com
>> Registered in England and Wales | Company number 6419015
>> 

Reply via email to