Thanks Harsh! And is this the right source code for the shuffling that is done in the reduce task?
http://search-hadoop.com/c/Hadoop:/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java%7C%7Cshuffle+sort -sb -----Original Message----- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, June 05, 2012 7:43 PM To: common-user@hadoop.apache.org Subject: Re: Shuffle/sort Hey Sean, Check out http://www.slideshare.net/jhammerb/hadoop-map-reduce-arch-106883, a slightly dated and MR1-oriented presentation from Owen O'Malley that goes a good level in-depth to get an overview of how things work (including how reduces pull data). After that, check out Chris Douglas' http://www.slideshare.net/hadoopusergroup/ordered-record-collection that goes in-depth into the evolution of the implementations of that layer. This is pretty much the state of 0.20/1.0 today too, and in 2.0 we have had Netty replacing Jetty among other improvements but I haven't a public document link to share on this yet. Others may share the changes docs on 2.0 if they have a link to one (or I'll respond back as soon as I have one). I hope this helps! On Wed, Jun 6, 2012 at 4:16 AM, Barry, Sean F <sean.f.ba...@intel.com> wrote: > "I was always wondering after mapping, how each reduce task get its > input. It is said in google's paper and hadoop's documentation that a > sort is done to aggregate the same key of the map output. But there is > no detailed explanation of how it is implemented and my intuition is > that perhaps a global hashing will work better than sorting. So I > really want to know the details and see whether my intuition is right. If I > can find out that in the source code, where should I start with?" > > I saw this question online and no one replied to it. does anyone know where I > go to study the source code for the shuffle and sort. > > -sean -- Harsh J