Purely based on estimates, assuming 5 billion transactions, 5 million users, 100K products normally distributed are expected to create a sparse item to item matrix of up to 10 Million significant co-occurrences (significance is not globally defined but in the context of the active item to recommend from; in other words support can be really tiny, confidence less so).
Given this data set I expect clustering and most of the classification to perform well based on the underlying Hadoop support. I also assume that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob would work similarly on such model. A few questions:- In 0.3 there was also a org.apache.mahout.cf.taste.hadoop.cooccurence.UserItemRecommender that I canot find in the latest trunk. Was this merged into RecommenderJob?Is there any example or unit test for the hadoop.item.RecommenderJob?- is there any more documentation on hadoop.pseudo? I am still not clear how that is broken into chunks in the case of larger models and how the results are being merged afterwards.- for clustering - if I want to create a few hundred user clusters - is that doable on a model similar to the one described above, based on boolean preferences? Many thanks.-qf --- On Mon, 4/26/10, Sean Owen <sro...@gmail.com> wrote: From: Sean Owen <sro...@gmail.com> Subject: Re: Algorithm scalability To: mahout-user@lucene.apache.org Received: Monday, April 26, 2010, 3:39 AM Yes, I think you'd have to be more specific to get anything but general answers -- The non-distributed algorithms scale best, if by "scale" you're referring to CPU/memory required per unit of output. But they hit a point where they can't run anymore because you'd need a single machine so large that it's impractical. Every algorithm has different needs as its input grows, and even needs different needs depending on the nature of its input (e.g. number of users versus number of items, not just total ratings, for recommenders). So there's not a single answer to how much is needed per unit of output. The distribution versions don't have this limit, so if you mean by "scale" the upper limit on size of input that can be processed, there isn't one. They generally require more CPU/memory per unit output in general due to the overhead of distributing, but then can scale infinitely.