On Tue, May 4, 2010 at 9:53 PM, First Qaxy <qa...@yahoo.ca> wrote: > Purely based on estimates, assuming 5 billion transactions, 5 million users, > 100K products normally distributed are expected to create a sparse item to > item matrix of up to 10 Million significant co-occurrences (significance is > not globally defined but in the context of the active item to recommend from; > in other words support can be really tiny, confidence less so).
Sounds like a pretty solid size of a data set. I think the recommender will work fine on this -- well, suppose it depends on your expectations but this whole piece has been completely revised recently and I feel that it's tuned nicely now. > A few questions:- In 0.3 there was also > a org.apache.mahout.cf.taste.hadoop.cooccurence.UserItemRecommender that I > canot find in the latest trunk. Was this merged into RecommenderJob?Is there > any example or unit test for the hadoop.item.RecommenderJob?- is there any > more documentation This has been merged in org.apache.mahout.cf.taste.hadoop.item.* as part of this complete overhaul. > on hadoop.pseudo? I am still not clear how that is broken into chunks in the > case of larger models and how the results are being merged afterwards.- for > clustering - if I want to create a few hundred user clusters - is that doable > on a model similar to the one described above, based on boolean preferences? For this scale, I don't think you can use the pseudo-distributed recommender. It's just too much data to get onto individual machines' memory. In this case nothing is broken down, since non-distributed algorithms generally use all data. It's just that one non-distributed recommender is cloned N times so you can crank out recommendations N times faster very easily. ... well since you don't have all that many items, I could imagine one algorithm working: slope-one. You would need to use Hadoop to compute the item-item diffs ahead of time, and prune it. But a pruned set of item-item diffs fits in memory. You could go this way. But I think this is the sort of situation very well suited to the properly distributed implementation.