hi peter, thinking aloud on this - trade-offs may depend on:
* how much grouping would be possible (tracking a PDF would be interesting for metrics) * locality of key/value pairs (distributed among mapper and reducer tasks) to that point, will there be much time spent in the shuffle? if so, it's probably cheaper to shuffle/sort the grouped row vectors than the many small key,value pair in any case, when i had a similar situation on a large data set (2-3 Tb shuffle) a good pattern to follow was: * mapper emitted small key,value pairs * combiner grouped into row vectors that combiner may get invoked both at the end of the map phase and at the beginning of the reduce phase (more benefit) also, using byte arrays if possible to represent values may be able to save much shuffle time best, paco On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch <[email protected]> wrote: > Hadoop streaming question: If I am forming a matrix M by summing a number of > elements generated on different mappers, is it better to emit tons of lines > from the mappers with small key,value pairs for each element, or should I > group them into row vectors before sending to the reducers? > > For example, say I'm summing frequency count matrices M for each user on a > different map task, and the reducer combines the resulting sparse user count > matrices for use in another calculation. > > Should I emit the individual elements: > > i (j, Mij) \n > 3 (1, 3.4) \n > 3 (2, 3.4) \n > 3 (3, 3.4) \n > 4 (1, 2.3) \n > 4 (2, 5.2) \n > > Or posting list style vectors? > > 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n > 4 ((1, 2.3), (2, 5.2)) \n > > Using vectors will at least save some message space, but are there any other > benefits to this approach in terms of Hadoop streaming overhead (sorts > etc.)? I think buffering issues will not be a huge concern since the length > of the vectors have a reasonable upper bound and will be in a sparse > format... > > > -- > Peter N. Skomoroch > 617.285.8348 > http://www.datawrangling.com > http://delicious.com/pskomoroch > http://twitter.com/peteskomoroch >
