Ah yeah I thought it might be a spam filter issues. Yeah, unfortunately the current code uses stuff only in Hadoop 0.20. You could run Hadoop 0.20 locally, upgrade your cluster (it can run older-style jobs I believe), or else roll back the code in that package that touches Hadoop by one revision. The last version would work on 0.18.3 I believe.
Hundreds of millions of users is big indeed. Sounds like you have way more users than items. This tells me that any user-based algorithm is probably out of the question. The model certainly can't be loaded into memory on one machine. We could work on ways to compute all pairs of similarities in a distributed way, but that's trillions of similarities, even after filtering out some unnecessary work. Item-based recommenders are more realistic. It would still take a long time to compute item-item similarities given the number of users you have, but at least you're only computing thousands to millions of such similarities. Grant is right -- perhaps you can use approaches unrelated to the preference data to compute item-item similarity. Given a fixed set of item-item similarities, it is fast to compute recommendations for any one user. It doesn't require loading the model into memory. Hence, you could then use the pseudo-distributed Hadoop framework I've pointed out to spread these computations for each user across many machines. For this -- you can test locally for sure. One machine can process recommendations just fine, given a fixed set of item-item similarities and an item-based recommender. Heck you don't even need Hadoop to see how well this works. I would try seeing how well the recommendations work first, before figuring out Hadoop. There are also slope-one algorithms. I think they give good results. They are going to be similar to item-based recommenders in this case. It requires precomputing a large matrix data structure (and there is a separate Hadoop job to do that in a distributed way) but it's also pretty fast at runtime. That is going to require Hadoop to precompute the data structure at your scale, so, I would try this next after item-based. On Fri, Jul 24, 2009 at 12:09 AM, Aurora Skarra-Gallagher<[email protected]> wrote: > Hi, > > Thank you for responding. My Spam filter was "out to get me" and your > responses were misclassified. > > I will investigate the Hadoop integration piece, specifically RecommenderJob. > Currently, the Hadoop grid I'm working with is using 0.18.3. Will that pose a > problem? I noticed some threads about versions of Hadoop less than 0.19 not > working. > > We are looking at starting with 70M users and scaling up to 500M eventually. > It is hard for me to estimate the number of items. We could be starting out > with 100, but as these items are entities that we extract, there could be > tens of thousands eventually. I would guess that most users would have less > than 100 of these. > > Does that help? I would be interested in your input on the algorithms and > also being a guinea pig for the code you're developing, if it makes sense. > > -Aurora
