Do you have a particular reason for not going with cosine?

On 19 October 2011 15:51, Sean Owen <[email protected]> wrote:

> Interesting question came up recently about using the Euclidean
> distance d between two vectors as a notion of their similarity.
>
> You can use 1 / (1 + d), which mostly works, except that it
> 'penalizes' larger vectors, who have more dimensions along which to
> differ. This is bad when those vectors are the subsets of user pref
> data in which two users overlap: more overlap ought to mean higher
> similarity.
>
> I have an ancient, bad kludge in there that uses n / (1 + d), where n
> is the size of the two vectors. It's trying to normalize away the
> average distance between randomly-chosen vectors in the space
> (remember that each dimension is bounded, between min and max rating).
> But that's not n.
>
> Is there a good formula or way of thinking about what that number
> should be? I can't find it on the internet.
>



-- 
Christian Prokopp | Data Mining Engineer & Marie Curie Fellow
http://www.mendeley.com/profiles/christian-prokopp/

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company number 6419015

Reply via email to