If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a way to add it to the classpath without the following patch?
https://issues.apache.org/jira/browse/HADOOP-3570 http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%[email protected]%3e On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch <[email protected]>wrote: > Paco, > > Thanks, good ideas on the combiner. I'm going to tweak things a bit as you > suggest and report back later... > > -Pete > > > On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <[email protected]> wrote: > >> hi peter, >> thinking aloud on this - >> >> trade-offs may depend on: >> >> * how much grouping would be possible (tracking a PDF would be >> interesting for metrics) >> * locality of key/value pairs (distributed among mapper and reducer >> tasks) >> >> to that point, will there be much time spent in the shuffle? if so, >> it's probably cheaper to shuffle/sort the grouped row vectors than the >> many small key,value pair >> >> in any case, when i had a similar situation on a large data set (2-3 >> Tb shuffle) a good pattern to follow was: >> >> * mapper emitted small key,value pairs >> * combiner grouped into row vectors >> >> that combiner may get invoked both at the end of the map phase and at >> the beginning of the reduce phase (more benefit) >> >> also, using byte arrays if possible to represent values may be able to >> save much shuffle time >> >> best, >> paco >> >> >> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch >> <[email protected]> wrote: >> > Hadoop streaming question: If I am forming a matrix M by summing a >> number of >> > elements generated on different mappers, is it better to emit tons of >> lines >> > from the mappers with small key,value pairs for each element, or should >> I >> > group them into row vectors before sending to the reducers? >> > >> > For example, say I'm summing frequency count matrices M for each user on >> a >> > different map task, and the reducer combines the resulting sparse user >> count >> > matrices for use in another calculation. >> > >> > Should I emit the individual elements: >> > >> > i (j, Mij) \n >> > 3 (1, 3.4) \n >> > 3 (2, 3.4) \n >> > 3 (3, 3.4) \n >> > 4 (1, 2.3) \n >> > 4 (2, 5.2) \n >> > >> > Or posting list style vectors? >> > >> > 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n >> > 4 ((1, 2.3), (2, 5.2)) \n >> > >> > Using vectors will at least save some message space, but are there any >> other >> > benefits to this approach in terms of Hadoop streaming overhead (sorts >> > etc.)? I think buffering issues will not be a huge concern since the >> length >> > of the vectors have a reasonable upper bound and will be in a sparse >> > format... >> > >> > >> > -- >> > Peter N. Skomoroch >> > 617.285.8348 >> > http://www.datawrangling.com >> > http://delicious.com/pskomoroch >> > http://twitter.com/peteskomoroch >> > >> > > > > -- > Peter N. Skomoroch > 617.285.8348 > http://www.datawrangling.com > http://delicious.com/pskomoroch > http://twitter.com/peteskomoroch > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
