Right! but how do you do that if you only saved co-occurrence counts? You can surely pull a very similarly-shaped trick to calculate the cosine measure; that's exactly what this paper is doing in fact. But it's a different computation.
Right now the job saves *all* the info it might need to calculate any of these things later. And that's heavy. On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix <[email protected]> wrote: > On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen <[email protected]> wrote: > >> How do you implement, for instance, the cosine similarity with this output? >> That's the intent behind preserving this info, which is surely a lot >> to preserve. >> > > Sorry to jump in the middle of this, but cosine is not too hard to use nice > combiners, as it can be done by first normalizing the rows and then > doing my ubiquitous "outer product of columns" trick on the resultant > corpus (this latter job uses combiners easily because the mappers do all > multiplications, and all reducers are simply sums, and thus are commutative > and associative). > > Not sure about the other fancy similarities.
