Paco,

Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
suggest and report back later...

-Pete

On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <[email protected]> wrote:

> hi peter,
> thinking aloud on this -
>
> trade-offs may depend on:
>
>   * how much grouping would be possible (tracking a PDF would be
> interesting for metrics)
>   * locality of key/value pairs (distributed among mapper and reducer
> tasks)
>
> to that point, will there be much time spent in the shuffle?  if so,
> it's probably cheaper to shuffle/sort the grouped row vectors than the
> many small key,value pair
>
> in any case, when i had a similar situation on a large data set (2-3
> Tb shuffle) a good pattern to follow was:
>
>   * mapper emitted small key,value pairs
>   * combiner grouped into row vectors
>
> that combiner may get invoked both at the end of the map phase and at
> the beginning of the reduce phase (more benefit)
>
> also, using byte arrays if possible to represent values may be able to
> save much shuffle time
>
> best,
> paco
>
>
> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
> <[email protected]> wrote:
> > Hadoop streaming question: If I am forming a matrix M by summing a number
> of
> > elements generated on different mappers, is it better to emit tons of
> lines
> > from the mappers with small key,value pairs for each element, or should I
> > group them into row vectors before sending to the reducers?
> >
> > For example, say I'm summing frequency count matrices M for each user on
> a
> > different map task, and the reducer combines the resulting sparse user
> count
> > matrices for use in another calculation.
> >
> > Should I emit the individual elements:
> >
> > i (j, Mij) \n
> > 3 (1, 3.4) \n
> > 3 (2, 3.4) \n
> > 3 (3, 3.4) \n
> > 4 (1, 2.3) \n
> > 4 (2, 5.2) \n
> >
> > Or posting list style vectors?
> >
> > 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
> > 4 ((1, 2.3), (2, 5.2)) \n
> >
> > Using vectors will at least save some message space, but are there any
> other
> > benefits to this approach in terms of Hadoop streaming overhead (sorts
> > etc.)?  I think buffering issues will not be a huge concern since the
> length
> > of the vectors have a reasonable upper bound and will be in a sparse
> > format...
> >
> >
> > --
> > Peter N. Skomoroch
> > 617.285.8348
> > http://www.datawrangling.com
> > http://delicious.com/pskomoroch
> > http://twitter.com/peteskomoroch
> >
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Reply via email to