On Jul 18, 2011, at 4:32 PM, Sean Owen wrote: > On Mon, Jul 18, 2011 at 8:52 PM, Grant Ingersoll <[email protected]> wrote: >> We definitely cannot do this all map side, but I think the big difference is >> the paper emits the partial sum of all the co-occurrences it's seen, whereas >> we emit all the little co-occurrences. The big difference is between one >> write of what amounts to a few identifying pieces and a double per outer >> loop iteration and one write per inner loop iteration of Cooccurrence, as I >> understand it. I think that makes for a huge difference in performance. > > If you mean the paper uses a combiner after the mapper, it may well > do. That's not quite possible in RowSimilarityJob since it emits more > than a count, which can't be 'combined'. And yes that has a price. If > you knew you only wanted counts and cooccurrences this pipeline could > be faster and slimmer.
If we only supported commutative functions, than it could be. > > I don't think the mapper can do any 'combining' on its own though > before emitting, if that's what you're suggesting. Within one map you > are touching one column; there is nothing from the map's output by > itself that can be summed before writing, because every co-occurrence > is from a different item-item pair. In the paper it does some map side computation. See figure 3 in the paper by Jimmy Lin (not the one in the javadocs, but the one I sent the other day). It's the partial sum of the weights. > > >> Feature reduction is fine and all, but that isn't the problem here and it >> has nothing to do with what I am talking about. I'm just saying that if >> RowSimilarityJob is saying it is implementing what is in the paper, then we >> aren't doing it right. As I said before, I'm doing this on a smaller >> matrix than what is in the paper and on much better hardware and the >> performance is 100x (my guess) worse. It claims 2 hour running time, I've >> yet to see this finish in 24-48 hours. > > I'm not talking about feature reduction either, nor is the paper; > we're talking about simple data pruning. You're seeking "what's > different" and this is most definitely a difference. There is no > comparable pruning in RowSimilarityJob. I already did significant pruning before even doing the run. They threw out the top 1% of terms of a corpus that is 2x mine before running their job. I've thrown away the top 20% of terms as well as only taken terms where the doc freq is at least 15, all on a much smaller corpus to start. Then I do the RowSimilarityJob. Their's completed in 2 hours. Mine doesn't complete in 48 hours using what is in Mahout. That's a sparse matrix of size 500K x 150K. That certainly isn't that big. I don't think RowSimJob, as is, scales. Sure, it's general purpose, but what good does that do when it doesn't complete? > That, and the fact that > RowSimilarityJob does something more general than what this paper > suggests; it's not just computing cosine measure similarity. And that > makes it slow down too. Agreed. The supporting of non algebraic similarity measures is the thing that forces the inner loop writes of co-occurrences and forces a whole lot more shuffling and sorting before the reduce phase. > > >> I think the big win is to not construe this implementation to be based on >> what is in the paper. I'm starting to think we should have two >> RowSimilarity jobs. One for algebraic functions and one for those who are >> not. > > I do really think it is the same idea and same shape of algorithm, and > it certainly is based on this paper. (It's a pretty simple idea > anyway; the full distributed recommender pipeline does the same thing > and predates this paper. It *does* have some features like limiting > row/column size; you might go dig there too if interested.) You might > agree more when you dig into the details. Most of it is similar. The outputting of the co-occurrences and doing the sum on the reduction side is the big difference from what I've seen. > > > There's a tradeoff between generality and performance, and an acute > one here. I think it's right for the one implementation here to work > correctly and be flexible at the expense of performance. For what it > does, it does it about as efficiently as I can imagine. Yes, but I think RowSimilarityJob is trying to do too much by supporting all kinds of distance measures, many of which won't complete in any useful amount of time on anything of significance and are therefore pointless. And here I am measuring "useful amount of time" as less than a day (and that's on a cluster of size 20 w/ lots of memory per node). > > If you fix some more assumptions -- fix your metric for example -- you > can certainly make something faster. It's not just two jobs that > emerge from trying to build some special-case pipelines -- it'll be > tens! Still there is probably a good way to weave in more hooks and > flags here so that it can take better advantage of assumptions such as > "I only need cooccurrence counts and so can use a combiner". > > > To be specific: I have a different pipeline out here, based on the > Mahout pipeline, which is heavily optimized for a very specific metric > and type of input. It just needs to process a lot less data by > specializing for exactly what's needed and knowing what can be thrown > out. Its similarity phase seems to run faster than even what's in the > paper (though I'm mentally extrapolating different results and may not > be that accurate.) > > That is: I am saying that the delta between this and a fast/optimized > specific pipeline is not big though the resulting difference in time > can be big! > > > I can tell you from experience that it's the counting of the > cooccurrence that utterly dominates unless you prune and use a > combiner and even then it's not quick. The calculation of similarity > is trivial in comparison; I don't think that will yield much speedup. I think the difference between outputting every co-occurrence and the partial sum is quite significant. > > Dig in and start messing with the code, you'll probably come to agree > with some of this, but also identify a lot of improvements! What do you think I've been doing? I just don't think having a generic version is all that useful for anything but smallish matrices, in which case you probably don't need Mahout to begin with. I think we should pare down the reach of RowSimilarityJob to something that actually is tractable. -Grant
