Thanks! Slope-one is on the map when I'll start to look into recommending based on user satisfaction(ratings). At this point I'm focusing on user interest which limits me to boolean based algorithms. -qf
--- On Tue, 5/4/10, Sean Owen <sro...@gmail.com> wrote: From: Sean Owen <sro...@gmail.com> Subject: Re: Algorithm scalability To: mahout-user@lucene.apache.org Received: Tuesday, May 4, 2010, 5:01 PM On Tue, May 4, 2010 at 9:53 PM, First Qaxy <qa...@yahoo.ca> wrote: > Purely based on estimates, assuming 5 billion transactions, 5 million users, > 100K products normally distributed are expected to create a sparse item to > item matrix of up to 10 Million significant co-occurrences (significance is > not globally defined but in the context of the active item to recommend from; > in other words support can be really tiny, confidence less so). Sounds like a pretty solid size of a data set. I think the recommender will work fine on this -- well, suppose it depends on your expectations but this whole piece has been completely revised recently and I feel that it's tuned nicely now. > A few questions:- In 0.3 there was also > a org.apache.mahout.cf.taste.hadoop.cooccurence.UserItemRecommender that I > canot find in the latest trunk. Was this merged into RecommenderJob?Is there > any example or unit test for the hadoop.item.RecommenderJob?- is there any > more documentation This has been merged in org.apache.mahout.cf.taste.hadoop.item.* as part of this complete overhaul. > on hadoop.pseudo? I am still not clear how that is broken into chunks in the > case of larger models and how the results are being merged afterwards.- for > clustering - if I want to create a few hundred user clusters - is that doable > on a model similar to the one described above, based on boolean preferences? For this scale, I don't think you can use the pseudo-distributed recommender. It's just too much data to get onto individual machines' memory. In this case nothing is broken down, since non-distributed algorithms generally use all data. It's just that one non-distributed recommender is cloned N times so you can crank out recommendations N times faster very easily. ... well since you don't have all that many items, I could imagine one algorithm working: slope-one. You would need to use Hadoop to compute the item-item diffs ahead of time, and prune it. But a pruned set of item-item diffs fits in memory. You could go this way. But I think this is the sort of situation very well suited to the properly distributed implementation.