Hi!
I'm experimenting with using the Mahout library's Taste implementation
to provide product recommendations for users as well as identifying
similar items. The data set is past sales - essentially just a boolean
relationship "customer X brought item Y". To get something simple
working - I can optimize and improve later - I just used the file data
model; my file looks like..
438356039,46305
438356039,46339
438386087,56304
<another 1.5 million or so entries here>
I then create a recommender like:
DataModel Model = new FileDataModel(Path);
ItemSimilarity SimilarityForItems = new PearsonCorrelationSimilarity(Model);
ItemBasedRecommender Item = new GenericItemBasedRecommender(Model,
SimilarityForItems);
And then do:
List<RecommendedItem> Recommended = Item.mostSimilarItems(ItemID, HowMany);
However, no results are returned. I went digging for why, and wound up
finding that the itemSimilarity method in AbstractSimilarity was always
consistently returning NaN. Looking for why, I found that it did indeed
find places where both users expressed a preference for an item, however
when computing the various centered sums they all came out to zero;
computeResult then always gives back NaN. If I comment out the call to
computeResult and instead replace it with one using the non-centered sums:
//double result = computeResult(count, centeredSumXY, centeredSumX2,
centeredSumY2, sumXYdiff2);
double result = computeResult(count, sumXY, sumX2, sumY2, sumXYdiff2);
Then I do get results; a similar hack in userSimilarity gives back
results from .recommend too.
My guess is that I'm more likely to be doing something wrong in how I'm
using Mahout rather than that I've stumbled on a bug, and naturally I'd
rather use the library "as it comes" rather than a patched version. :-)
However, I'm not sure what I'm doing wrong, and I'm also decidedly not
an expert in this field so I'm not familiar with the details of the
computations being done here. Any thoughts on where I'm going wrong
would be welcomed. If it helps to know, I'm using the latest (0.2) release.
Many thanks for any insight,
Jonathan