One possibly interesting idea, as a middle ground between emitting
just one metric-specific datum, and keeping absolutely all input data.
What if you emitted all the relevant functions of the cooccurrence you
might need? If you emitted the count 1, the product of two item's
ratings, and the difference, for example, you'd be able to compute
almost all the metrics I see here. And that's small, and combine-able.

And then of course, it's not much harder for the job to select what it
needs based on the metric.

Maybe there's a devil lurking in the details there but I bet that'd be
a decent way forward.

On Mon, Jul 18, 2011 at 11:40 PM, Ted Dunning <[email protected]> wrote:
> Yes.  I am advocating that we emit the counts.
>
> On Mon, Jul 18, 2011 at 3:05 PM, Sean Owen <[email protected]> wrote:
>
>> The difference is that the job is not outputting those "1s" now. It's
>> actually outputting the list of original preference values for any
>> occurrences of that pair! You can't combine those away. That's why the
>> output is so big!
>>
>> If that's what you intended, then I completely agree.
>>
>

Reply via email to