On Sat, Dec 5, 2009 at 5:17 PM, Sean Owen <[email protected]> wrote: > I suggest for purposes of the project we would build implementations > of Recommender that can consume some output from Hadoop on HDFS, like > a SequenceFile or whatever it's called. Shouldn't be hard at all. This > sort of hybrid approach is already what happens with slope-one -- I > wrote some jobs to build its diffs and then you can load the output > into SlopeOneRecommender -- which works online from there. >
Something generic like this would be helpful, I think, as well as outputting to a Lucene index. I wonder how important these implementations are for the project, > which seems like a bit of heresy -- surely Mahout needs to support > recommendation on huge amounts of data? I think the answer's yes, but: > > LinkedIn and Netflix and Apple and most organizations with huge data > to recommend from have already developed sophisticated, customized > solutions. > Actually, from direct experience and conversations with principals involved, I can tell you that you would be surprised at the unsophistication some parts of the production systems at all three of these places (as well as, eg. Amazon). Mahout could end up becoming large parts of some of their infrastructure for doing this at some point. Organizations with less than 100M data points or so to process don't > need distributed architectures. They can use Mahout as-is with its > online non-distributed recommenders pretty well. 10 lines of code and > one big server and a day of tinkering and they have a full-on simple > recommender engine, online or offline. And I argue that this is about > 90% of users of the project who want recommendations. > Today, yes. In a year, this number will maybe be 80%. In 2 years - maybe 60%. Big data is coming to smaller and smaller organizations. Usage data doesn't need to be only internal: stuff you mine off of the web can be used too... > So who are these organizations that have enough data (like 1B+ data > points) that they need something like the rocket science that LinkedIn > needs, but can't or haven't developed such capability already > in-house? > > I guess that's why I've been reluctant to engineer and complicate the > framework to fit in offline distributed recommendation -- because this > can become as complex as we like -- since I wonder at the 'market' for > it. But it seems inevitable that this must exist, even if just as a > nice clean simple reference implementation of the idea. Perhaps I > won't go overboard on designing something complex yet here at the > moment. > All of the above which I said aside: I agree that over-engineering something to do this is not desireable. But thinking about how we output partially processed "matrices" for on-line recommendation generation is something we should still do. Maybe we're adequately served by spitting out SequenceFiles, with a simple api for zipping through them and producing scores using pluggable scoring functions? -jake
