On Mon, Jul 18, 2011 at 8:52 PM, Grant Ingersoll <[email protected]> wrote: > We definitely cannot do this all map side, but I think the big difference is > the paper emits the partial sum of all the co-occurrences it's seen, whereas > we emit all the little co-occurrences. The big difference is between one > write of what amounts to a few identifying pieces and a double per outer loop > iteration and one write per inner loop iteration of Cooccurrence, as I > understand it. I think that makes for a huge difference in performance.
If you mean the paper uses a combiner after the mapper, it may well do. That's not quite possible in RowSimilarityJob since it emits more than a count, which can't be 'combined'. And yes that has a price. If you knew you only wanted counts and cooccurrences this pipeline could be faster and slimmer. I don't think the mapper can do any 'combining' on its own though before emitting, if that's what you're suggesting. Within one map you are touching one column; there is nothing from the map's output by itself that can be summed before writing, because every co-occurrence is from a different item-item pair. > Feature reduction is fine and all, but that isn't the problem here and it has > nothing to do with what I am talking about. I'm just saying that if > RowSimilarityJob is saying it is implementing what is in the paper, then we > aren't doing it right. As I said before, I'm doing this on a smaller > matrix than what is in the paper and on much better hardware and the > performance is 100x (my guess) worse. It claims 2 hour running time, I've > yet to see this finish in 24-48 hours. I'm not talking about feature reduction either, nor is the paper; we're talking about simple data pruning. You're seeking "what's different" and this is most definitely a difference. There is no comparable pruning in RowSimilarityJob. That, and the fact that RowSimilarityJob does something more general than what this paper suggests; it's not just computing cosine measure similarity. And that makes it slow down too. > I think the big win is to not construe this implementation to be based on > what is in the paper. I'm starting to think we should have two RowSimilarity > jobs. One for algebraic functions and one for those who are not. I do really think it is the same idea and same shape of algorithm, and it certainly is based on this paper. (It's a pretty simple idea anyway; the full distributed recommender pipeline does the same thing and predates this paper. It *does* have some features like limiting row/column size; you might go dig there too if interested.) You might agree more when you dig into the details. There's a tradeoff between generality and performance, and an acute one here. I think it's right for the one implementation here to work correctly and be flexible at the expense of performance. For what it does, it does it about as efficiently as I can imagine. If you fix some more assumptions -- fix your metric for example -- you can certainly make something faster. It's not just two jobs that emerge from trying to build some special-case pipelines -- it'll be tens! Still there is probably a good way to weave in more hooks and flags here so that it can take better advantage of assumptions such as "I only need cooccurrence counts and so can use a combiner". To be specific: I have a different pipeline out here, based on the Mahout pipeline, which is heavily optimized for a very specific metric and type of input. It just needs to process a lot less data by specializing for exactly what's needed and knowing what can be thrown out. Its similarity phase seems to run faster than even what's in the paper (though I'm mentally extrapolating different results and may not be that accurate.) That is: I am saying that the delta between this and a fast/optimized specific pipeline is not big though the resulting difference in time can be big! I can tell you from experience that it's the counting of the cooccurrence that utterly dominates unless you prune and use a combiner and even then it's not quick. The calculation of similarity is trivial in comparison; I don't think that will yield much speedup. Dig in and start messing with the code, you'll probably come to agree with some of this, but also identify a lot of improvements!
