Thanks! 
Slope-one is on the map when I'll start to look into recommending based on user 
satisfaction(ratings). At this point I'm focusing on user interest which limits 
me to boolean based algorithms.
-qf

--- On Tue, 5/4/10, Sean Owen <sro...@gmail.com> wrote:

From: Sean Owen <sro...@gmail.com>
Subject: Re: Algorithm scalability
To: mahout-user@lucene.apache.org
Received: Tuesday, May 4, 2010, 5:01 PM

On Tue, May 4, 2010 at 9:53 PM, First Qaxy <qa...@yahoo.ca> wrote:
> Purely based on estimates, assuming 5 billion transactions, 5 million users, 
> 100K products normally distributed are expected to create a sparse item to 
> item matrix of up to 10 Million significant co-occurrences (significance is 
> not globally defined but in the context of the active item to recommend from; 
> in other words support can be really tiny, confidence less so).

Sounds like a pretty solid size of a data set. I think the recommender
will work fine on this -- well, suppose it depends on your
expectations but this whole piece has been completely revised recently
and I feel that it's tuned nicely now.


> A few questions:- In 0.3 there was also 
> a org.apache.mahout.cf.taste.hadoop.cooccurence.UserItemRecommender that I 
> canot find in the latest trunk. Was this merged into RecommenderJob?Is there 
> any example or unit test for the hadoop.item.RecommenderJob?- is there any 
> more documentation

This has been merged in org.apache.mahout.cf.taste.hadoop.item.* as
part of this complete overhaul.

>  on hadoop.pseudo? I am still not clear how that is broken into chunks in the 
> case of larger models and how the results are being merged afterwards.- for 
> clustering - if I want to create a few hundred user clusters - is that doable 
> on a model similar to the one described above, based on boolean preferences?

For this scale, I don't think you can use the pseudo-distributed
recommender. It's just too much data to get onto individual machines'
memory. In this case nothing is broken down, since non-distributed
algorithms generally use all data. It's just that one non-distributed
recommender is cloned N times so you can crank out recommendations N
times faster very easily.

... well since you don't have all that many items, I could imagine one
algorithm working: slope-one. You would need to use Hadoop to compute
the item-item diffs ahead of time, and prune it. But a pruned set of
item-item diffs fits in memory. You could go this way.

But I think this is the sort of situation very well suited to the
properly distributed implementation.


Reply via email to