The way partitioning and sorting currently happens on the map side is that <key, value> output records are transformed into <partition, key, value> and this is sorted on partition and key.
I assume it would be more efficient to just group the records into partitions and then sort each of these individually, so why do we do a global sort? My only guess is that this simplifies things when there are buffer spills to disk and not all records are in memory. That being said, I've not yet run into a case where buffer spills couldn't be avoided by properly configuring buffer parameters. I've not run any benchmarks on this, but I'm just curious what others think, whether the difference would be negligible, etc. Ed
