My goal is to see how the performance and network utilization of TeraSort
is affected by varying the replication factor from 1-3 on my 16-node
cluster. (I have modified TeraSort such that it uses my system's
replication factor.) I am sorting 100GB.

In particular, I am confused by the network utilization. With 1 replica,
the network utilization is under 1GB. With 2 replicas, it is about 117GB.
And with 3 replicas, it is about 225-230GB.

I understand that just replicating the 100GB of sorted data causes 100GB
and 200GB of network traffic in the 2 and 3 replica configurations,
respectively. However, what accounts for the extra 17GB and 25-30GB in the
2 and 3 replica configs? And what accounts for the minimal network usage in
the 1 replica configuration?

Note that the data is generated with TeraGen using the same replication
factor with which it is later sorted.

Thank you,
Eitan Rosenfeld

Reply via email to