On Sat, Dec 5, 2009 at 4:27 PM, Sean Owen <[email protected]> wrote: The biggest pain for me here is how to rationalize all of this into an > API. The current code is completely online. Now I'm dropping in a > truly offline/distributed version, which is a totally different > ballgame. And then there are all these hybrid approaches, computing > some stuff offline and some online and requiring real-time integration > with HDFS.
This is something I've been thinking about a lot too - at LinkedIn we do a ton of offline Hadoop-based computation for recommendations, but then a bunch of stuff is/can be done online. You can do it with Lucene, as Ted suggests (and in fact that is one of our implementations), or by doing a lot of precomputation of results and storing them in a key-value store (in our case, Voldemort). But having a nice api for *outputting* the precomputed matrices which are pretty big into a format where online "queries"/recommendation requests can be computed I think is really key here. We should think much more about what makes the most sense here. -jake
