Do you have a particular reason for not going with cosine? On 19 October 2011 15:51, Sean Owen <[email protected]> wrote:
> Interesting question came up recently about using the Euclidean > distance d between two vectors as a notion of their similarity. > > You can use 1 / (1 + d), which mostly works, except that it > 'penalizes' larger vectors, who have more dimensions along which to > differ. This is bad when those vectors are the subsets of user pref > data in which two users overlap: more overlap ought to mean higher > similarity. > > I have an ancient, bad kludge in there that uses n / (1 + d), where n > is the size of the two vectors. It's trying to normalize away the > average distance between randomly-chosen vectors in the space > (remember that each dimension is bounded, between min and max rating). > But that's not n. > > Is there a good formula or way of thinking about what that number > should be? I can't find it on the internet. > -- Christian Prokopp | Data Mining Engineer & Marie Curie Fellow http://www.mendeley.com/profiles/christian-prokopp/ Mendeley Limited | London, UK | www.mendeley.com Registered in England and Wales | Company number 6419015
