On Mon, Jul 18, 2011 at 3:09 PM, Sean Owen <[email protected]> wrote: > Right! but how do you do that if you only saved co-occurrence counts? >
Yeah, you can't do it with only that. You can do it if you keep the pairs with the "current overlap sum": (A, B, <overlap>), then the combiner and reducer is just summing overlaps. > You can surely pull a very similarly-shaped trick to calculate the > cosine measure; that's exactly what this paper is doing in fact. But > it's a different computation. > > Right now the job saves *all* the info it might need to calculate any > of these things later. And that's heavy. > Yeah, I guess I see that. Which similarity measures require all this extra baggage? > > On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix <[email protected]> > wrote: > > On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen <[email protected]> wrote: > > > >> How do you implement, for instance, the cosine similarity with this > output? > >> That's the intent behind preserving this info, which is surely a lot > >> to preserve. > >> > > > > Sorry to jump in the middle of this, but cosine is not too hard to use > nice > > combiners, as it can be done by first normalizing the rows and then > > doing my ubiquitous "outer product of columns" trick on the resultant > > corpus (this latter job uses combiners easily because the mappers do all > > multiplications, and all reducers are simply sums, and thus are > commutative > > and associative). > > > > Not sure about the other fancy similarities. >
