On Mon, Jul 18, 2011 at 3:09 PM, Sean Owen <[email protected]> wrote:

> Right! but how do you do that if you only saved co-occurrence counts?
>

Yeah, you can't do it with only that.  You can do it if you keep the pairs
with the "current overlap sum": (A, B, <overlap>), then the combiner
and reducer is just summing overlaps.


> You can surely pull a very similarly-shaped trick to calculate the
> cosine measure; that's exactly what this paper is doing in fact. But
> it's a different computation.
>
> Right now the job saves *all* the info it might need to calculate any
> of these things later. And that's heavy.
>

Yeah, I guess I see that.  Which similarity measures require all this
extra baggage?


>
> On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix <[email protected]>
> wrote:
> > On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen <[email protected]> wrote:
> >
> >> How do you implement, for instance, the cosine similarity with this
> output?
> >> That's the intent behind preserving this info, which is surely a lot
> >> to preserve.
> >>
> >
> > Sorry to jump in the middle of this, but cosine is not too hard to use
> nice
> > combiners, as it can be done by first normalizing the rows and then
> > doing my ubiquitous "outer product of columns" trick on the resultant
> > corpus (this latter job uses combiners easily because the mappers do all
> > multiplications, and all reducers are simply sums, and thus are
> commutative
> > and associative).
> >
> > Not sure about the other fancy similarities.
>

Reply via email to