Hm, something is off indeed. Tanimoto should be notably faster than a cosine measure correlation -- it's doing a simple, optimized set intersection and union rather than iterating over a bunch of preference values. While 5M data points is going to consume a reasonable amount of memory, I would not guess it would exhaust a 1GB heap -- should be in the hundreds of megs.
If you can run only the recommender in the JVM, obviously that frees up memory. I would probably remove the caching wrapper too if memory is at a premium, but that's not your problem. If you are running on a 64-bit machine in 64-bit mode, try 32-bit mode (-d32) to reduce the object overhead in the JVM. >From there, you could load the data in a DB instead and use a JDBC-based DataModel, since that doesn't load in memory. You could also try adapting my NetflixDataModel which reads from data organized in directories on disk. But no something just doesn't seem right, your current setup should be OK. I think I need to try replicate this with a similarly sized data set and see what's up. On Thu, Apr 30, 2009 at 5:48 PM, Paul Loy <[email protected]> wrote: > Hi Sean, > > that worked fine. The only issues are: > > 1) it's much slower (I guess that's due to Tanimoto being more complex than > straight cosine) > 2) I run out of memory with a dataset of 5 million rows even with 1GB of > heap space. > > (1) doesn't bother me so much. I can run it once a week over a couple of > days. (2) is a major blocker. I'm guessing that this is because I'm loading > my entire 5 million rows into memory at once? Is there some way to batch > process? I'm guessing there is as it's all Lucene backed. > > my example code now looks like: > > DataModel model = new BooleanPrefUserFileDataModel(new > File("sales.txt")); > > BooleanTanimotoCoefficientSimilarity userSimilarity = new > BooleanTanimotoCoefficientSimilarity(model); > //userSimilarity.setPreferenceInferrer(new > AveragingPreferenceInferrer(model)); > > UserNeighborhood neighborhood = > new NearestNUserNeighborhood(20, userSimilarity, model); > > > Recommender recommender = > new GenericUserBasedRecommender(model, neighborhood, > userSimilarity); > Recommender cachingRecommender = new > CachingRecommender(recommender); > > List<RecommendedItem> recommendations = > cachingRecommender.recommend("1030", 100); > > for (RecommendedItem item : recommendations) { > System.out.println(item); > } > > Any help will be greatly appreciated! > > Thanks, > > Paul. > > > > On Mon, Apr 27, 2009 at 10:17 PM, Sean Owen <[email protected]> wrote: > >> Yeah the problem here is that all the ratings are '1', and a >> correlation-based similarity metric like Pearson will return a "NaN" >> for the similarity between all users as a result. >> >> You want to take advantage of the situation by using the bits of code >> that assume you are in this situation, where all the ratings are the >> same or 1 or don't matter. Support for this mode is still a bit >> evolving, but basically you want to: >> >> - Use BooleanTanimotoCoefficientSimilarity instead of Pearson. >> - Omit the ",1" in the data file -- in fact you need to to get this to >> work. >> - Also separately I might generally discourage people from trying >> PreferenceInferrer unless you know you need or want it; I don't really >> like this technique. In fact for the similarity implementation above >> it won't be supported. So just remove that line. >> >> If any problems come up write back, might have missed a detail there. >> >> 2009/4/27 Paul Loy <[email protected]>: >> > Hi, >> > >> > I want to create recommendations for my customers based on boolean data. >> > Essencially whether they bought a product. >> > >> > So this will create a csv containing: >> > >> > acctId, itemId, 1 >> > >> > There is an entry in the CSV for each sale. So all entries will have a >> > 'rating' of 1. Using the following example: >> > >> > DataModel model = new FileDataModel(new File("data.txt")); >> > >> > PearsonCorrelationSimilarity userSimilarity = new >> > PearsonCorrelationSimilarity(model); >> > userSimilarity.setPreferenceInferrer(new >> > AveragingPreferenceInferrer(model)); >> > >> > UserNeighborhood neighborhood = >> > new NearestNUserNeighborhood(1, userSimilarity, model); >> > >> > Recommender recommender = >> > new GenericUserBasedRecommender(model, neighborhood, >> > userSimilarity); >> > Recommender cachingRecommender = new >> > CachingRecommender(recommender); >> > >> > List<RecommendedItem> recommendations = >> > cachingRecommender.recommend("1967128", 10); >> > >> > for (RecommendedItem item : recommendations) { >> > System.out.println(item); >> > } >> > >> > I get 0 recommendations even when I have seeded the file with obvious >> > correlations. I'm guessing this is because all 'ratings' are 1. Is there >> any >> > way to infer that all other items have a rating of 0, thus giving the >> > algorithms something to correlate? >> > >> > Thanks, >> > >> > Paul >> > >> > >> > >> > -- >> > --------------------------------------------- >> > Paul Loy >> > [email protected] >> > http://www.keteracel.com/paul >> > >> > > > > -- > --------------------------------------------- > Paul Loy > [email protected] > http://www.keteracel.com/paul >
