Yeah this is exactly what I do !  For "associative array", read
"sparse vector". You can still use a combiner, and it works better in
my experience. Although you're pushing the same amount of data through
the combiner, in practice having that many fewer records makes the
combine phase faster.

The issue here is that, as it stands, a combiner can't be used at all.
It is not outputting cooccurrence counts, which could be summed. It's
actually outputting the co-occurring data, since in some algorithms
that's needed. You can't combine those.

On Mon, Jul 18, 2011 at 9:35 PM, Dhruv Kumar <[email protected]> wrote:
> The "pairs" approach where each co-occurring pair is emitted by mapper as a
> key value pair is I/O heavy but can be run on clusters with small per node
> memory.
>
> There is another design pattern by Jimmy Lin called the "stripes" where the
> co-occurences per term are aggregated into associative arrays. The mappers
> emit keys as the term, and the values as the term's aggregated associative
> array.
>
> The "stripes" term is less network I/O heavy but uses more memory per node
> becuase the hashmaps can get quite big.
>
> Can the RowSimilarityJob be redesigned to use the "stripes" design pattern
> as shown in the following paper by Jimmy Lin?
>
> http://www.aclweb.org/anthology-new/D/D08/D08-1044.pdf
>
>
>
> On Mon, Jul 18, 2011 at 3:58 PM, Grant Ingersoll <[email protected]>wrote:
>
>>
>> On Jul 18, 2011, at 3:52 PM, Grant Ingersoll wrote:
>> >
>> > I think the big win is to not construe this implementation to be based on
>> what is in the paper.  I'm starting to think we should have two
>> RowSimilarity jobs.  One for algebraic functions and one for those who are
>> not.
>>
>>
>> And I don't mean two completely separate, just a different mapper/reducer
>> for that phase (and likely a combiner, too).
>>
>> -Grant
>>
>>
>

Reply via email to