You *can* definitely emit all the data in one large record rather than many small records! You'd probably have to recode a lot of other jobs, but it's theoretically not so hard unless I miss something. That's not a problem, and not somehow a design decision forced by anything else.
But I'm suggesting that, even if you do that, it won't be much different. It's the same amount of data, not being combined, and that's probably going to dominate any such detail. On Mon, Jul 18, 2011 at 11:01 PM, Grant Ingersoll <[email protected]> wrote: > I believe what the paper is advocating is that one outputs the partial > weights of the co-occurrences, already pre computed. Again, it's the > difference between emitting in the inner loop and the outer loop of the code > below. I gotta believe that is an order of magnitude reduction in the amount > of stuff that has to be sorted and shuffled and then reduced. But, it does > preclude us from supporting some similarity measures, I suppose. > > <code> > for (int n = 0; n < weightedOccurrences.length; n++) { > int rowA = weightedOccurrences[n].getRow(); > double weightA = weightedOccurrences[n].getWeight(); > double valueA = weightedOccurrences[n].getValue(); > for (int m = n; m < weightedOccurrences.length; m++) { > int rowB = weightedOccurrences[m].getRow(); > double weightB = weightedOccurrences[m].getWeight(); > double valueB = weightedOccurrences[m].getValue(); > if (rowA <= rowB) { > rowPair.set(rowA, rowB, weightA, weightB); > coocurrence.set(column.get(), valueA, valueB); > } else { > rowPair.set(rowB, rowA, weightB, weightA); > coocurrence.set(column.get(), valueB, valueA); > } > ctx.write(rowPair, coocurrence); // INNER LOOP > numPairs++; > } > //VERSUS EMITTING HERE > } > </code> > > -Grant > > On Jul 18, 2011, at 5:47 PM, Sean Owen wrote: > >> Completely agree; I had thought the suggestion was that the paper >> shows combining within one map invocation. I don't believe that's >> possible here since one map will output at most one value for any key. >> >> On Mon, Jul 18, 2011 at 10:42 PM, Ted Dunning <[email protected]> wrote: >>> The combiner works across multiple invocations of the map function and may >>> be applied on the reduce side as well. > > -------------------------- > Grant Ingersoll > > > >
