Following on my last e-mail -- yes I was not crazy and did implement a basic weighting mechanism like I described, in PearsonCorrelationSimilarity. You can select it in the constructor and see what happens.
There is also a LogLikelihoodSimilarity like what Ted mentions. I only implemented ItemSimilarity but UserSimilarity could be added with a bit more work -- take a look. What is the co-occurrence approach Ted? On Thu, Jun 18, 2009 at 4:43 PM, Ted Dunning<[email protected]> wrote: > Grant, > > The data you described is pretty simple and should produce good results at > all levels of overlap. That it does not is definitely a problem. In fact, > I would recommend making the data harder to deal with by giving non-Lincoln > items highly variable popularities and then making the groundlings rate > items according to their popularity. This will result in an apparent > pattern where the inclusion of any number of non-lincoln fans will show an > apparent pattern of liking popular items. The correct inference should, > however, be that any neighbor group that has a large number of Lincoln fans > seems to like popular items less than expected. > > For problems like this, I have had good luck with using measures that were > robust in the face of noise (aka small counts) and in the face of large > external trends (aka the top-40 problem). The simplest one that I know of > is the generalized multinomial log-likelihood > ratio<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>that > you hear me nattering about so often. LSI does a decent job of > dealing > with the top-40, but has trouble with small counts. LDA and related > probabiliistic methods should work somewhat better than log-likelihood > ratio, but are considerably more complex to implement. > > The key here is to compare counts within the local neighborhood to counts > outside the neighborhood. Things that are significantly different about the > neighborhood relative to rest of the world are candidates for > recommendation. Things to avoid when looking for interesting differences > include: > > - correlation measures such as Pearson's R (based on normal distribution > approximation and unscaled thus suffers from both small count and top-40 > problems) > > - anomaly measures based simply on frequency ratios (very sensitive to small > count problems, doesn't account for top-40 at all) > > What I would recommend for a nearest neighbor approach is to continue with > the current neighbor retrieval, but switch to a log-likelihood ratio for > generating recommendations. > > What I would recommend for a production system would be to scrap the nearest > neighbor approach entirely and go to a coocurrence matrix based approach. > This costs much less to compute at recommendation and is very robust against > both small counts and top-40 issues. >
