There's already a cosine distance measure implementation available; this concerns the right-est way to implement a Euclidean distance-based measure.
On Wed, Oct 19, 2011 at 4:12 PM, Christian Prokopp <[email protected]> wrote: > Do you have a particular reason for not going with cosine? > > On 19 October 2011 15:51, Sean Owen <[email protected]> wrote: > >> Interesting question came up recently about using the Euclidean >> distance d between two vectors as a notion of their similarity. >> >> You can use 1 / (1 + d), which mostly works, except that it >> 'penalizes' larger vectors, who have more dimensions along which to >> differ. This is bad when those vectors are the subsets of user pref >> data in which two users overlap: more overlap ought to mean higher >> similarity. >> >> I have an ancient, bad kludge in there that uses n / (1 + d), where n >> is the size of the two vectors. It's trying to normalize away the >> average distance between randomly-chosen vectors in the space >> (remember that each dimension is bounded, between min and max rating). >> But that's not n. >> >> Is there a good formula or way of thinking about what that number >> should be? I can't find it on the internet. >> > > > > -- > Christian Prokopp | Data Mining Engineer & Marie Curie Fellow > http://www.mendeley.com/profiles/christian-prokopp/ > > Mendeley Limited | London, UK | www.mendeley.com > Registered in England and Wales | Company number 6419015 >
