Yes, I agree that keeping all pairs is quite expensive, unless your data set is relatively small (like tens of thousands of items). If you're not running out of memory, OK, you can get away with it for now.
But yes, many of the similarities will not contain much information and don't add much value -- the question is, which ones? For Pearson correlation-based similarity, it's not just a matter of keeping the ones with the largest and smallest similarity scores -- nearest 1 or -1. A similarity of 0 could still be very useful information. I think you would actually want to keep an item-item pair based on how many users expressed a preference for both items. The more, the more important it is to keep that pair. If you'd like an example of efficiently looking through a large list of things, and keeping only the "top n" of them, see the TopItems class. You don't want to generate all pairs at once, then throw some away -- that would still run you out of memory. Ted will say, and I again I agree, that Pearson is not usually the best similarity metric, though it is widely mentioned in collaborative filtering examples and literature. What Ted quotes below is implemented in the framework as LogLikelihoodSimilarity. For that, I believe it *is* the pairs with the largest resulting similarity score that you do want to keep. Or at least it is more reasonable. Ted maybe you can check my thinking on that. Sean On Mon, Nov 9, 2009 at 7:09 AM, Ted Dunning <[email protected]> wrote: > Close. > > See the link below for one approach to finding the most important ones. I > believe that Sean has added something like this to Taste/Mahout. > > http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
