Re: Algorithm scalability

Sean Owen Tue, 04 May 2010 14:01:46 -0700

On Tue, May 4, 2010 at 9:53 PM, First Qaxy <qa...@yahoo.ca> wrote:
> Purely based on estimates, assuming 5 billion transactions, 5 million users, 
> 100K products normally distributed are expected to create a sparse item to 
> item matrix of up to 10 Million significant co-occurrences (significance is 
> not globally defined but in the context of the active item to recommend from; 
> in other words support can be really tiny, confidence less so).


Sounds like a pretty solid size of a data set. I think the recommender
will work fine on this -- well, suppose it depends on your
expectations but this whole piece has been completely revised recently
and I feel that it's tuned nicely now.


> A few questions:- In 0.3 there was also 
> a org.apache.mahout.cf.taste.hadoop.cooccurence.UserItemRecommender that I 
> canot find in the latest trunk. Was this merged into RecommenderJob?Is there 
> any example or unit test for the hadoop.item.RecommenderJob?- is there any 
> more documentation

This has been merged in org.apache.mahout.cf.taste.hadoop.item.* as
part of this complete overhaul.

>  on hadoop.pseudo? I am still not clear how that is broken into chunks in the 
> case of larger models and how the results are being merged afterwards.- for 
> clustering - if I want to create a few hundred user clusters - is that doable 
> on a model similar to the one described above, based on boolean preferences?

For this scale, I don't think you can use the pseudo-distributed
recommender. It's just too much data to get onto individual machines'
memory. In this case nothing is broken down, since non-distributed
algorithms generally use all data. It's just that one non-distributed
recommender is cloned N times so you can crank out recommendations N
times faster very easily.

... well since you don't have all that many items, I could imagine one
algorithm working: slope-one. You would need to use Hadoop to compute
the item-item diffs ahead of time, and prune it. But a pruned set of
item-item diffs fits in memory. You could go this way.

But I think this is the sort of situation very well suited to the
properly distributed implementation.

Re: Algorithm scalability

Reply via email to