Agree, again that is why there is a weighting option in the Pearson implementation, to deemphasize small count-based computations.
Do any of the approaches you cite take into the account the value of the rating itself? I agree, seems like there should be some alternative to Pearson / cosine-measure to offer, but right now it's the only similarity metric that cares about the rating. On Tue, Jun 23, 2009 at 7:17 PM, Ted Dunning<[email protected]> wrote: > To beat a very tired horse, I think that all squared error correlation > measures (Pearson's chi-squared, Pearson's correlation, squared deviation > and so on) are completely suspect for small count data. Furthermore, any > reasonable sample of truly long-tail phenomena includes great numbers of > small counts. Furtherfurthermore, long-tail phenomena are the rule rather > than the exception. > > Thus, I almost never like these measures and would have a hard time arguing > that there is anything good about this kind of measure. The only exception > would be in a pub where I would take any side of any debate for the > amusement of the crowd. > > Try mutual information or multinomial likelihood ratios instead. > > On Tue, Jun 23, 2009 at 3:48 PM, Sean Owen <[email protected]> wrote: > >> One could argue that this behavior is actually a good thing -- basing >> an estimate of similarity based on one data point could be very >> unreliable. >> >
