Re: RowSimilarity ?'s

Sean Owen Mon, 18 Jul 2011 15:06:23 -0700

On Mon, Jul 18, 2011 at 10:55 PM, Grant Ingersoll <[email protected]> wrote:
> Yes, sorry.  Figure 5.  I don't know what the Combiner has to do with this.  
> That's an added bonus.  I'm talking about the emit step. (line 7).  We are 
> doing the emit step inside the inner loop for all and outputting every little 
> co-occurence in the name of supporting a whole host of similarity measures, 
> they are doing it outside and theoretically only support algebraic functions 
> (and have implemented one).  I _UNDERSTAND_ why we are doing this, I just 
> don't think it is worthwhile to be so generic.


Correct, a Combiner is gravy. This loop difference is not the difference here.

Either way, you're emitting the same data. You can emit tuples:
1. (A,B)->1 (A,D)->1 (A,E)->1 (B,C)->1 ...
or pack them into an array/vector/list
2. (A,[B->1,D->1,E->1]) (B,[C->1,...]) ...

The job does #1; I actually like/prefer #2 from the paper. Either way,
you can use a combiner. Either way, you can support what you're
calling an 'algebraic' metric. This is not the difference.

The difference is that the job is not outputting those "1s" now. It's
actually outputting the list of original preference values for any
occurrences of that pair! You can't combine those away. That's why the
output is so big!

If that's what you intended, then I completely agree.


I personally thought it 'got away' with this by being really
aggressive about pruning, but it actually doesn't do that. If I happen
to be right about that, that's easy to fix. But I don't know if that's
it yet.

Re: RowSimilarity ?'s

Reply via email to