Off-line user based analysis is quite feasible, however. We worked with data larger than this at Veoh and could crunch the data down to usable form in 10 hours on a 20 core micro-cluster.
The key step is computing sparse co-occurrences and filtering for interesting non-zero values. On Fri, Jul 24, 2009 at 2:11 AM, Sean Owen <[email protected]> wrote: > Hundreds of millions of users is big indeed. Sounds like you have way > more users than items. This tells me that any user-based algorithm is > probably out of the question. The model certainly can't be loaded into > memory on one machine. We could work on ways to compute all pairs of > similarities in a distributed way, but that's trillions of > similarities, even after filtering out some unnecessary work. > -- Ted Dunning, CTO DeepDyve
