Re: Hadoop streaming performance: elements vs. vectors

Peter Skomoroch Sun, 05 Apr 2009 19:36:50 -0700

If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a
way to add it to the classpath without the following patch?


https://issues.apache.org/jira/browse/HADOOP-3570

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%[email protected]%3e

On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
<[email protected]>wrote:

> Paco,
>
> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
> suggest and report back later...
>
> -Pete
>
>
> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <[email protected]> wrote:
>
>> hi peter,
>> thinking aloud on this -
>>
>> trade-offs may depend on:
>>
>>   * how much grouping would be possible (tracking a PDF would be
>> interesting for metrics)
>>   * locality of key/value pairs (distributed among mapper and reducer
>> tasks)
>>
>> to that point, will there be much time spent in the shuffle?  if so,
>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>> many small key,value pair
>>
>> in any case, when i had a similar situation on a large data set (2-3
>> Tb shuffle) a good pattern to follow was:
>>
>>   * mapper emitted small key,value pairs
>>   * combiner grouped into row vectors
>>
>> that combiner may get invoked both at the end of the map phase and at
>> the beginning of the reduce phase (more benefit)
>>
>> also, using byte arrays if possible to represent values may be able to
>> save much shuffle time
>>
>> best,
>> paco
>>
>>
>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>> <[email protected]> wrote:
>> > Hadoop streaming question: If I am forming a matrix M by summing a
>> number of
>> > elements generated on different mappers, is it better to emit tons of
>> lines
>> > from the mappers with small key,value pairs for each element, or should
>> I
>> > group them into row vectors before sending to the reducers?
>> >
>> > For example, say I'm summing frequency count matrices M for each user on
>> a
>> > different map task, and the reducer combines the resulting sparse user
>> count
>> > matrices for use in another calculation.
>> >
>> > Should I emit the individual elements:
>> >
>> > i (j, Mij) \n
>> > 3 (1, 3.4) \n
>> > 3 (2, 3.4) \n
>> > 3 (3, 3.4) \n
>> > 4 (1, 2.3) \n
>> > 4 (2, 5.2) \n
>> >
>> > Or posting list style vectors?
>> >
>> > 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>> > 4 ((1, 2.3), (2, 5.2)) \n
>> >
>> > Using vectors will at least save some message space, but are there any
>> other
>> > benefits to this approach in terms of Hadoop streaming overhead (sorts
>> > etc.)?  I think buffering issues will not be a huge concern since the
>> length
>> > of the vectors have a reasonable upper bound and will be in a sparse
>> > format...
>> >
>> >
>> > --
>> > Peter N. Skomoroch
>> > 617.285.8348
>> > http://www.datawrangling.com
>> > http://delicious.com/pskomoroch
>> > http://twitter.com/peteskomoroch
>> >
>>
>
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop streaming performance: elements vs. vectors

Reply via email to