RandomWriter/Sort sanity check

Steve Schlosser Wed, 09 May 2007 12:18:30 -0700

Hello all

I've been doing some scaling experiments on our 13-node Hadoop cluster
using the RandomWriter/Sort apps as a benchmark.  To start with, I
modified the benchmark to just write out 1GB of data per node, rather
than the default 10GB, since I don't have a whole lot of disk capacity
at the moment.  I get almost exactly linear scaling for RandomWriter,
from 8MB/s on one node to 107MB/s for 13 nodes.  Good so far.


Interestingly enough, I get super-linear scaling with the Sort program:

(1st column - # nodes)
(2nd column - Sort throughput in MB/s)
(3rd column - Standard deviation over 5 trials)
(4th & 5th column - # of task and job failures, respectively)

1       0.917   0.012   0       0
2       2.968   0.042   0       0
3       5.530   0.081   0       0
4       7.239   0.103   0       0
5       9.756   0.129   0       0
6       10.590  0.120   0       0
7       15.128  0.392   0       0
8       15.882  0.587   0       0
9       16.466  0.473   0       0
10      17.541  0.183   0       0
11      18.381  0.477   0       0
12      19.419  0.201   0       0
13      31.014  0.880   0       0

12 nodes gets me a 21X speedup over 1 node, and 13 nodes gets me a 33x
speedup over 1 node.  This seems too good to be true - what could
Hadoop be doing?  For the 13-node runs, there are only ever 13 reduce
tasks, and mapred.tasktracker.tasks.maximum is set to 1.  Can anyone
shed some light?

Just to make sure I understand, Sort itself does nothing but force
Hadoop to partition the input data (1GB per node in my case) and sort
it.  Should I think of the sort as being part of the Map phase or the
Reduce phase?  That is, is there one sort per node?  One sort per Map
task?  One sort per Reduce task?

Thanks.

-steve

RandomWriter/Sort sanity check

Reply via email to