Agree, again that is why there is a weighting option in the Pearson
implementation, to deemphasize small count-based computations.

Do any of the approaches you cite take into the account the value of
the rating itself? I agree, seems like there should be some
alternative to Pearson / cosine-measure to offer, but right now it's
the only similarity metric that cares about the rating.

On Tue, Jun 23, 2009 at 7:17 PM, Ted Dunning<[email protected]> wrote:
> To beat a very tired horse, I think that all squared error correlation
> measures (Pearson's chi-squared, Pearson's correlation, squared deviation
> and so on) are completely suspect for small count data.  Furthermore, any
> reasonable sample of truly long-tail phenomena includes great numbers of
> small counts.  Furtherfurthermore, long-tail phenomena are the rule rather
> than the exception.
>
> Thus, I almost never like these measures and would have a hard time arguing
> that there is anything good about this kind of measure.  The only exception
> would be in a pub where I would take any side of any debate for the
> amusement of the crowd.
>
> Try mutual information or multinomial likelihood ratios instead.
>
> On Tue, Jun 23, 2009 at 3:48 PM, Sean Owen <[email protected]> wrote:
>
>> One could argue that this behavior is actually a good thing -- basing
>> an estimate of similarity based on one data point could be very
>> unreliable.
>>
>

Reply via email to