My goal is to see how the performance and network utilization of TeraSort is affected by varying the replication factor from 1-3 on my 16-node cluster. (I have modified TeraSort such that it uses my system's replication factor.) I am sorting 100GB.
In particular, I am confused by the network utilization. With 1 replica, the network utilization is under 1GB. With 2 replicas, it is about 117GB. And with 3 replicas, it is about 225-230GB. I understand that just replicating the 100GB of sorted data causes 100GB and 200GB of network traffic in the 2 and 3 replica configurations, respectively. However, what accounts for the extra 17GB and 25-30GB in the 2 and 3 replica configs? And what accounts for the minimal network usage in the 1 replica configuration? Note that the data is generated with TeraGen using the same replication factor with which it is later sorted. Thank you, Eitan Rosenfeld