There's already a cosine distance measure implementation available;
this concerns the right-est way to implement a Euclidean
distance-based measure.

On Wed, Oct 19, 2011 at 4:12 PM, Christian Prokopp
<[email protected]> wrote:
> Do you have a particular reason for not going with cosine?
>
> On 19 October 2011 15:51, Sean Owen <[email protected]> wrote:
>
>> Interesting question came up recently about using the Euclidean
>> distance d between two vectors as a notion of their similarity.
>>
>> You can use 1 / (1 + d), which mostly works, except that it
>> 'penalizes' larger vectors, who have more dimensions along which to
>> differ. This is bad when those vectors are the subsets of user pref
>> data in which two users overlap: more overlap ought to mean higher
>> similarity.
>>
>> I have an ancient, bad kludge in there that uses n / (1 + d), where n
>> is the size of the two vectors. It's trying to normalize away the
>> average distance between randomly-chosen vectors in the space
>> (remember that each dimension is bounded, between min and max rating).
>> But that's not n.
>>
>> Is there a good formula or way of thinking about what that number
>> should be? I can't find it on the internet.
>>
>
>
>
> --
> Christian Prokopp | Data Mining Engineer & Marie Curie Fellow
> http://www.mendeley.com/profiles/christian-prokopp/
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company number 6419015
>

Reply via email to