On Mon, Jul 18, 2011 at 10:15 PM, Grant Ingersoll <[email protected]> wrote:
> If we only supported commutative functions, than it could be.

No, it makes no difference. Again: it's that the output of this phase
in RowSimilarityJob *is not a count*. It can't be summed.

>
>>
>> I don't think the mapper can do any 'combining' on its own though
>> before emitting, if that's what you're suggesting. Within one map you
>> are touching one column; there is nothing from the map's output by
>> itself that can be summed before writing, because every co-occurrence
>> is from a different item-item pair.
>
> In the paper it does some map side computation.  See figure 3 in the paper by 
> Jimmy Lin (not the one in the javadocs, but the one I sent the other day).  
> It's the partial sum of the weights.

This is not the same procedure that (I thought) we were talking about.

Figure 3 is just the creation of the inverted index -- the columns, or
items, in my terminology. It is not counting anything.
Figure 5 does that. You can see that it does actually emit all pairs.
It does not show a combiner, but it sure could use one.

(well, what it does is actually exactly what Dhruv just mentioned.
It's all pairs, but output as a series of N arrays, not NxN individual
tuples.)


> I already did significant pruning before even doing the run. They threw out 
> the top 1% of terms of a corpus that is 2x mine before running their job.  
> I've thrown away the top 20% of terms as well as only taken terms where the 
> doc freq is at least 15, all on a much smaller corpus to start.   Then I do 
> the RowSimilarityJob.  Their's completed in 2 hours.  Mine doesn't complete 
> in 48 hours using what is in Mahout.  That's a sparse matrix of size 500K x 
> 150K.  That certainly isn't that big.   I don't think RowSimJob, as is, 
> scales.  Sure, it's general purpose, but what good does that do when it 
> doesn't complete?

Maybe Sebastian can comment. Sounds like you're doing it all right,
but that's quite slow. As he's said, it's used in production at
reasonably big places like Foursquare. It does "something" right --
but clearly it's falling down in this case for some reason.


> Agreed.  The supporting of non algebraic similarity measures is the thing 
> that forces the inner loop writes of co-occurrences and forces a whole lot 
> more shuffling and sorting before the reduce phase.

(Nope, we're still talking about two things here. The "counting" is
what I am referring to above and is the slow bit. The similarity
computation later is pretty trivial in comparison. Trust me.)


> Most of it is similar.  The outputting of the co-occurrences and doing the 
> sum on the reduction side is the big difference from what I've seen.

See above -- not a difference, no. I think there's still confusion
about which mapper is doing this work; it's figure 5.


> What do you think I've been doing?  I just don't think having a generic 
> version is all that useful for anything but smallish matrices, in which case 
> you probably don't need Mahout to begin with.
>
> I think we should pare down the reach of RowSimilarityJob to something that 
> actually is tractable.

I agree that there are opportunities to speed this up if you add more
hooks and flags -- though I think I would not agree with your
reasoning so far about where those bits are! Clearly this stuff works
fine for some people, and not for your data, which doesn't lead to the
conclusion that this job is useless.

it sure cries out for more information. For example, which phase is
the slow one? how sparse are the rows? I think these could
confirm/deny theories and either point out solutions, or confirm that
there is just going to need some changes to perform better.

Reply via email to