On Mon, May 14, 2012 at 10:40 AM, Barry, Sean F <sean.f.ba...@intel.com> wrote: > I am having a bit of trouble understanding how the Terasort benchmark works, > especially the fundamentals of how the data is sorted. If the data is being > split into many chunks wouldn't it all have to be re-integrated back into the > entire dataset?
Before the job is launched, the input is sampled to find "cut" points. Those cut points are used to assign keys to reduces. For example, if you have 100 reduces, there are 99 keys chosen. All keys less than the first are sent to the first reduce, between the first two keys are sent to the second reduce and so on. The logic is done by the TotalOrderPartitioner, which replaces MapReduce's default HashPartitioner. -- Owen