Taking your comments out of order. On Tue, Sep 8, 2009 at 2:41 PM, Sean Owen <[email protected]> wrote:
> And yeah my world view is not centered around big off-line > computations. In that world, yes I bet this is a good way to compute > all recommendations at once. I still maintain that interesting use > cases involve roughly real-time updates and recommendations. > This can be describe using the method that I use by just moving parentheses. The big off-line version of recommendation is (A' A) h while the more more on-line version is A' (A h). The overall shape of the computation is the same but the first form allows the majority of the computation to be done off-line. My focus in the past has always been to produce recs really fast (sometimes hundreds per second). As is generally the case, doing as much computation ahead of time is a great way to get fast response. > I think reducing it to a matrix multiplication problem is tidy, and > works I am sure. > This approach has been good for me for the last decade of building rec engines. I don't know if, practically, it affords the tweaks > and hooks and modifications that let some of these stock algos > actually produce good results. It can probably be made to work too. > As I tried to say in my original comment, I am only describing the shape or pattern for the computation. There are lots of tweaks that can be added without changing this shape. Also, there are lots of ways of implementing the general pattern. One of my favorites is to use a text retrieval engine for the matrix-vector multiplication part and a map-reduce engine for the matrix-matrix part. I tend to view requirements for lots of hooks and modifications as an indication that the underlying algorithms are not phrased well. That isn't always true, but it often is. Use of methods that depend on normal distribution assumptions often have pathological performance for small counts. This leads to lots of things like "if (n > 5)" sorts of hacks to avoid problematic cases. These methods include most chi-squared techniques, correlation based recommendation engines and LSA. The right answer is to avoid the original mistake and use log-likelihood ratio tests, probabilistic cooccurrence measures and LDA respectively. As another example, in text retrieval, stop words are often used because results for many retrieval engines are sooo bad without them. Really, though, this indicates a failure of the scoring algorithms, the phrase recognition algorithms and possibly the index format. All of these can be fixed. Whether it is worth fixing is another question, but the inference from the need for stop lists that the fundamentals are somewhat broken is still valid. My gut is that it is not the most efficient way to do it -- if only > because it definitely is not efficient to compute this to produce > *one* set of recs, or, to answer many of the questions one asks of a > recommender, which is not just to produce recs. > Efficiency is a tricky thing here. Because the off-line and on-line forms produce the same computation, and because the majority of the work is in the off-line part, it is a definite win for response time to do off-line work. But there is much more to it than that. The off-line form of the computation can be done using very clean, very fast sequential accesses to the data. The on-line form requires lots of random accesses to data. That difference makes the computation whalingly faster to do off-line. In my experience "whalingly" can be three orders of magnitude which is a huge factor. Even more important, however, is that there are lots of streamlinings that can be done if you have global information and these can save many more orders of magnitude in speed. For instance, suppose that you have some item that 50 million people have interacted with. Do you think that you will have any noticeable accuracy loss if you consider only 1 million of those interactions? Or even if you only consider a few thousand of them? The correct answer is that you can randomly downsample to about a thousand interactions with no perceptible difference in accuracy. This has a huge impact because the cost of computing the coocurrence counts between two items goes up with the product of the prevalence of each item. If you decrease each by a factor of a thousand, you have decreased the necessary computation by a million. Unfortunately, the overall prevalence is a global property that is much easier to deal with off-line. Similarly, sparsification by statistical means is trivial off-line and difficult on-line. This can, again, save you orders of magnitude in cost at final recommendation time. As a concrete example of just how massive these speedups are, I worked on one system that had about 100-200 million interactions per day with about 10 million items. We used 7 months of history to generate recommendations. This included about 7 x 30 x 100-200 M = 20-40 billion observations. Using the techniques I have described we were able to build a system that did well over 100 recommendations per second. The off-line computation could be done relatively easily in about 10 hours and would take less with more recent versions of Pig and Hadoop. Total hardware included a tiny hadoop cluster with 10-20 cores and about 40GBytes of RAM for the off-line portion and a few real-time servers. With more modern machines, this entire operation could have been done with 2-4 1U servers. The recommendation quality was excellent. The system captured changes in recommendations much faster than they actually occurred. That may not have been the *most* efficient way to solve the problem, but it was efficient enough to be hard to beat. -- Ted Dunning, CTO DeepDyve
