I am having a bit of trouble understanding how the Terasort benchmark works, especially the fundamentals of how the data is sorted. If the data is being split into many chunks wouldn't it all have to be re-integrated back into the entire dataset?
And since a terabyte is huge wouldn't it take a very long time. I seem to be missing a few crucial steps in the process and if someone could help me understand how terasort is working that would be great. Any papers or videos on this topic would be greatly appreciated. -SB