Purely based on estimates, assuming 5 billion transactions, 5 million users, 
100K products normally distributed are expected to create a sparse item to item 
matrix of up to 10 Million significant co-occurrences (significance is 
not globally defined but in the context of the active item to recommend from; 
in other words support can be really tiny, confidence less so). 

Given this data set I expect clustering and most of the classification to 
perform well based on the underlying Hadoop support. I also assume 
that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob would work similarly 
on such model.
A few questions:- In 0.3 there was also 
a org.apache.mahout.cf.taste.hadoop.cooccurence.UserItemRecommender that I 
canot find in the latest trunk. Was this merged into RecommenderJob?Is there 
any example or unit test for the hadoop.item.RecommenderJob?- is there any more 
documentation
 on hadoop.pseudo? I am still not clear how that is broken into chunks in the 
case of larger models and how the results are being merged afterwards.- for 
clustering - if I want to create a few hundred user clusters - is that doable 
on a model similar to the one described above, based on boolean preferences?
Many thanks.-qf
--- On Mon, 4/26/10, Sean Owen <sro...@gmail.com> wrote:

From: Sean Owen <sro...@gmail.com>
Subject: Re: Algorithm scalability
To: mahout-user@lucene.apache.org
Received: Monday, April 26, 2010, 3:39 AM

Yes, I think you'd have to be more specific to get anything but
general answers --

The non-distributed algorithms scale best, if by "scale" you're
referring to CPU/memory required per unit of output. But they hit a
point where they can't run anymore because you'd need a single
 machine
so large that it's impractical.

Every algorithm has different needs as its input grows, and even needs
different needs depending on the nature of its input (e.g. number of
users versus number of items, not just total ratings, for
recommenders). So there's not a single answer to how much is needed
per unit of output.

The distribution versions don't have this limit, so if you mean by
"scale" the upper limit on size of input that can be processed, there
isn't one. They generally require more CPU/memory per unit output in
general due to the overhead of distributing, but then can scale
infinitely.


Reply via email to