On Mon, Jul 18, 2011 at 8:52 PM, Grant Ingersoll <[email protected]> wrote:
> We definitely cannot do this all map side, but I think the big difference is 
> the paper emits the partial sum of all the co-occurrences it's seen, whereas 
> we emit all the little co-occurrences.  The big difference is between one 
> write of what amounts to a few identifying pieces and a double per outer loop 
> iteration and one write per inner loop iteration of Cooccurrence, as I 
> understand it.  I think that makes for a huge difference in performance.

If you mean the paper uses a combiner after the mapper, it may well
do. That's not quite possible in RowSimilarityJob since it emits more
than a count, which can't be 'combined'. And yes that has a price. If
you knew you only wanted counts and cooccurrences this pipeline could
be faster and slimmer.

I don't think the mapper can do any 'combining' on its own though
before emitting, if that's what you're suggesting. Within one map you
are touching one column; there is nothing from the map's output by
itself that can be summed before writing, because every co-occurrence
is from a different item-item pair.


> Feature reduction is fine and all, but that isn't the problem here and it has 
> nothing to do with what I am talking about.  I'm just saying that if 
> RowSimilarityJob is saying it is implementing what is in the paper, then we 
> aren't doing it right.    As I said before, I'm doing this on a smaller 
> matrix than what is in the paper and on much better hardware and the 
> performance is 100x (my guess) worse.  It claims 2 hour running time, I've 
> yet to see this finish in 24-48 hours.

I'm not talking about feature reduction either, nor is the paper;
we're talking about simple data pruning. You're seeking "what's
different" and this is most definitely a difference. There is no
comparable pruning in RowSimilarityJob. That, and the fact that
RowSimilarityJob does something more general than what this paper
suggests; it's not just computing cosine measure similarity. And that
makes it slow down too.


> I think the big win is to not construe this implementation to be based on 
> what is in the paper.  I'm starting to think we should have two RowSimilarity 
> jobs.  One for algebraic functions and one for those who are not.

I do really think it is the same idea and same shape of algorithm, and
it certainly is based on this paper. (It's a pretty simple idea
anyway; the full distributed recommender pipeline does the same thing
and predates this paper. It *does* have some features like limiting
row/column size; you might go dig there too if interested.) You might
agree more when you dig into the details.


There's a tradeoff between generality and performance, and an acute
one here. I think it's right for the one implementation here to work
correctly and be flexible at the expense of performance. For what it
does, it does it about as efficiently as I can imagine.

If you fix some more assumptions -- fix your metric for example -- you
can certainly make something faster. It's not just two jobs that
emerge from trying to build some special-case pipelines -- it'll be
tens! Still there is probably a good way to weave in more hooks and
flags here so that it can take better advantage of assumptions such as
"I only need cooccurrence counts and so can use a combiner".


To be specific: I have a different pipeline out here, based on the
Mahout pipeline, which is heavily optimized for a very specific metric
and type of input. It just needs to process a lot less data by
specializing for exactly what's needed and knowing what can be thrown
out. Its similarity phase seems to run faster than even what's in the
paper (though I'm mentally extrapolating different results and may not
be that accurate.)

That is: I am saying that the delta between this and a fast/optimized
specific pipeline is not big though the resulting difference in time
can be big!


I can tell you from experience that it's the counting of the
cooccurrence that utterly dominates unless you prune and use a
combiner and even then it's not quick. The calculation of similarity
is trivial in comparison; I don't think that will yield much speedup.

Dig in and start messing with the code, you'll probably come to agree
with some of this, but also identify a lot of improvements!

Reply via email to