Paco, Thanks, good ideas on the combiner. I'm going to tweak things a bit as you suggest and report back later...
-Pete On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <[email protected]> wrote: > hi peter, > thinking aloud on this - > > trade-offs may depend on: > > * how much grouping would be possible (tracking a PDF would be > interesting for metrics) > * locality of key/value pairs (distributed among mapper and reducer > tasks) > > to that point, will there be much time spent in the shuffle? if so, > it's probably cheaper to shuffle/sort the grouped row vectors than the > many small key,value pair > > in any case, when i had a similar situation on a large data set (2-3 > Tb shuffle) a good pattern to follow was: > > * mapper emitted small key,value pairs > * combiner grouped into row vectors > > that combiner may get invoked both at the end of the map phase and at > the beginning of the reduce phase (more benefit) > > also, using byte arrays if possible to represent values may be able to > save much shuffle time > > best, > paco > > > On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch > <[email protected]> wrote: > > Hadoop streaming question: If I am forming a matrix M by summing a number > of > > elements generated on different mappers, is it better to emit tons of > lines > > from the mappers with small key,value pairs for each element, or should I > > group them into row vectors before sending to the reducers? > > > > For example, say I'm summing frequency count matrices M for each user on > a > > different map task, and the reducer combines the resulting sparse user > count > > matrices for use in another calculation. > > > > Should I emit the individual elements: > > > > i (j, Mij) \n > > 3 (1, 3.4) \n > > 3 (2, 3.4) \n > > 3 (3, 3.4) \n > > 4 (1, 2.3) \n > > 4 (2, 5.2) \n > > > > Or posting list style vectors? > > > > 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n > > 4 ((1, 2.3), (2, 5.2)) \n > > > > Using vectors will at least save some message space, but are there any > other > > benefits to this approach in terms of Hadoop streaming overhead (sorts > > etc.)? I think buffering issues will not be a huge concern since the > length > > of the vectors have a reasonable upper bound and will be in a sparse > > format... > > > > > > -- > > Peter N. Skomoroch > > 617.285.8348 > > http://www.datawrangling.com > > http://delicious.com/pskomoroch > > http://twitter.com/peteskomoroch > > > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
