Hey Sean, Check out http://www.slideshare.net/jhammerb/hadoop-map-reduce-arch-106883, a slightly dated and MR1-oriented presentation from Owen O'Malley that goes a good level in-depth to get an overview of how things work (including how reduces pull data).
After that, check out Chris Douglas' http://www.slideshare.net/hadoopusergroup/ordered-record-collection that goes in-depth into the evolution of the implementations of that layer. This is pretty much the state of 0.20/1.0 today too, and in 2.0 we have had Netty replacing Jetty among other improvements but I haven't a public document link to share on this yet. Others may share the changes docs on 2.0 if they have a link to one (or I'll respond back as soon as I have one). I hope this helps! On Wed, Jun 6, 2012 at 4:16 AM, Barry, Sean F <sean.f.ba...@intel.com> wrote: > "I was always wondering after mapping, how each reduce task get its input. It > is said in > google's paper and hadoop's documentation that a sort is done to aggregate the > same key of the map output. But there is no detailed explanation of how it is > implemented and my intuition is that perhaps a global hashing will work better > than sorting. So I really want to know the details and see whether my > intuition > is right. If I can find out that in the source code, where should I start > with?" > > I saw this question online and no one replied to it. does anyone know where I > go to study the source code for the shuffle and sort. > > -sean -- Harsh J