On Jul 18, 2011, at 5:41 PM, Sean Owen wrote: > On Mon, Jul 18, 2011 at 10:15 PM, Grant Ingersoll <[email protected]> wrote: >> If we only supported commutative functions, than it could be. > > No, it makes no difference. Again: it's that the output of this phase > in RowSimilarityJob *is not a count*. It can't be summed. > >> >>> >>> I don't think the mapper can do any 'combining' on its own though >>> before emitting, if that's what you're suggesting. Within one map you >>> are touching one column; there is nothing from the map's output by >>> itself that can be summed before writing, because every co-occurrence >>> is from a different item-item pair. >> >> In the paper it does some map side computation. See figure 3 in the paper >> by Jimmy Lin (not the one in the javadocs, but the one I sent the other >> day). It's the partial sum of the weights. > > This is not the same procedure that (I thought) we were talking about. > > Figure 3 is just the creation of the inverted index -- the columns, or > items, in my terminology. It is not counting anything. > Figure 5 does that. You can see that it does actually emit all pairs. > It does not show a combiner, but it sure could use one.
Yes, sorry. Figure 5. I don't know what the Combiner has to do with this. That's an added bonus. I'm talking about the emit step. (line 7). We are doing the emit step inside the inner loop for all and outputting every little co-occurence in the name of supporting a whole host of similarity measures, they are doing it outside and theoretically only support algebraic functions (and have implemented one). I _UNDERSTAND_ why we are doing this, I just don't think it is worthwhile to be so generic. > > (well, what it does is actually exactly what Dhruv just mentioned. > It's all pairs, but output as a series of N arrays, not NxN individual > tuples.) > > >> I already did significant pruning before even doing the run. They threw out >> the top 1% of terms of a corpus that is 2x mine before running their job. >> I've thrown away the top 20% of terms as well as only taken terms where the >> doc freq is at least 15, all on a much smaller corpus to start. Then I do >> the RowSimilarityJob. Their's completed in 2 hours. Mine doesn't complete >> in 48 hours using what is in Mahout. That's a sparse matrix of size 500K x >> 150K. That certainly isn't that big. I don't think RowSimJob, as is, >> scales. Sure, it's general purpose, but what good does that do when it >> doesn't complete? > > Maybe Sebastian can comment. Sounds like you're doing it all right, > but that's quite slow. As he's said, it's used in production at > reasonably big places like Foursquare. It does "something" right -- > but clearly it's falling down in this case for some reason. > > >> Agreed. The supporting of non algebraic similarity measures is the thing >> that forces the inner loop writes of co-occurrences and forces a whole lot >> more shuffling and sorting before the reduce phase. > > (Nope, we're still talking about two things here. The "counting" is > what I am referring to above and is the slow bit. The similarity > computation later is pretty trivial in comparison. Trust me.) > I don't think so and it isn't what I am seeing. I think the fact that we are outputting a ton of little things (co-occurrences) is _WAY_ more expensive than outputting partial weights at the outer loop. > >> Most of it is similar. The outputting of the co-occurrences and doing the >> sum on the reduction side is the big difference from what I've seen. > > See above -- not a difference, no. I think there's still confusion > about which mapper is doing this work; it's figure 5. Yes. it is figure 5. My bad on the numbering. > > >> What do you think I've been doing? I just don't think having a generic >> version is all that useful for anything but smallish matrices, in which case >> you probably don't need Mahout to begin with. >> >> I think we should pare down the reach of RowSimilarityJob to something that >> actually is tractable. > > I agree that there are opportunities to speed this up if you add more > hooks and flags -- though I think I would not agree with your > reasoning so far about where those bits are! Clearly this stuff works > fine for some people, and not for your data, which doesn't lead to the > conclusion that this job is useless. > I just think it is trying to be way too generic. > it sure cries out for more information. For example, which phase is > the slow one? how sparse are the rows? I think these could > confirm/deny theories and either point out solutions, or confirm that > there is just going to need some changes to perform better. They are very sparse vectors (it's general text). Both the co-occur and the reduce similarity phases take a long time. I can try to get more details. I will also try to see if I can run it on the ASF mail archives. -------------------------- Grant Ingersoll
