Understanding Network Utilization of TeraSort

Eitan Rosenfeld Fri, 09 Jan 2015 18:34:10 -0800

My goal is to see how the performance and network utilization of TeraSort
is affected by varying the replication factor from 1-3 on my 16-node
cluster. (I have modified TeraSort such that it uses my system's
replication factor.) I am sorting 100GB.


In particular, I am confused by the network utilization. With 1 replica,
the network utilization is under 1GB. With 2 replicas, it is about 117GB.
And with 3 replicas, it is about 225-230GB.

I understand that just replicating the 100GB of sorted data causes 100GB
and 200GB of network traffic in the 2 and 3 replica configurations,
respectively. However, what accounts for the extra 17GB and 25-30GB in the
2 and 3 replica configs? And what accounts for the minimal network usage in
the 1 replica configuration?

Note that the data is generated with TeraGen using the same replication
factor with which it is later sorted.

Thank you,
Eitan Rosenfeld

Understanding Network Utilization of TeraSort

Reply via email to