Hi.

I ran some performance tests (randomwrite/sort) on a small Hadoop
cluster.  The numbers are unexpected.  Some numbers are so far off, I
suspect that either I didn't tune all the right knobs, or I had wrong
expectations.  Hence I am seeking suggestions on how to tune the
cluster and/or explanations on how underlying parts of Hadoop works.



Here is the setup:
--------------------------
cluster:
 hadoop 0.12.3
 jdk 1.6.0_01
 HDFS file replication factor: 3
 7 machines total
   1 machine for namenode
   1 machine for jobtracker
   5 other machines for slaves (datanodes / mapreduce)


machine spec:
 2GHz duo core AMD Opteron
 8GB memory
 1 Gbit ethernet
 64-bit Linux 2.6.18-8.1.1.el5

measured disk/network performance:
 ~75MB/sec disk write (tested by "dd if=/dev/zero of=$f ..." to
generate 20GB file)
 ~76MB/sec disk read (tested by "cat $f > /dev/null" on the 20GB
generated above)
 ~30MB/sec for copying file from one machine to another (tested by
scp) on 20GB file).

I hope the 20GB size is large enough to alleviate the effects of in
memory file caching and/or network traffic fluctuations.


Single file Test
-----------------------
I started this test by reformatting the DFS, then tweaked the random
writer to run only one mapper so that it produces only 1 file.   Looks
like when writing out only one file, the machine that ran the map gets
twice as much data written on its disk than the other datanodes.   Any
idea why this is happening and how to get a more even distribution?

Just making sure, does 3x replication guarantees that copies of the
data will be kept on 3 different machines?  (hopefully, the DFS is not
assigning 2 copies of the data blocks to that mapper machine).


Multiple files write
---------------------------
Given that the single-file case, the data is not evenly distributed, I
reformat the DFS and reran random writer for 5GB output with 5 mappers
(i.e., 1GB for each mapper).  This took 294secs with no map/reduce
task failures.    Here are some question on this test:

- This run used up 25GB of DFS.   I was expecting 3x replication to
mean only 15GB is used.  What is using the other 10GB?    The 25GB
usage is computed based on name nodes' output on "capacity" and "used"
percentage.  It confirms with the "remaining" stat.  Just in case I am
reading this wrong, the "Live Datanodes" table's "blocks" column adds
up to 288 blocks.  Is the block size the same as the DFS block size
(which is 64MB in my case)?   If so, this means 18GB worth of blocks.
This is closer but the numbers still don't seem to add up (what
happened to the other 3GB of diskspace, and why it doesn't match
"Capacity"/"Remaining"/"Used" stats.

- Including replication, 15GB gets written.   Given that we have 5
machines/disks writing in parallel, each machine is writing at about
10.4MB/sec, which is about 1/7th of raw disk throughput.   Is this
expected, or are there parameters that I can set to improve this?


Sorter
----------
I use the Sorter example given on the 5x1GB files generated by
RandomWriter.  It ran with 80 map tasks and 6 reducers, and took
1345sec to complete, and there are no map/reduce task failures.

Looking more closely, a few questions:
- it ran 10 maps at a time, is there a way to run only 5 maps at a
time (and hopefully the scheduler will be able to schedule 1 map on
each machine accessing only local data).

- a few mapper processed 67MB instead of 64MB.  why?  (I had thought
DFS block size is the limit).

- the fastest map task took 59sec, and the slowest took 139 sec.
If it were purely local disk access, reading and writing <70MB of data
should have taken only a few secs total.   Any idea why there is such
a large performance discrepancy and how to tune this?   Is there a way
to check what % of tasks are working from local disk vs from remote
datanodes?

- The reduce tasks all spent about 760 sec in the shuffle phase.
Since we are sorting 5GB of data on 5 machines, I assume ~4 GB get
transferred over the network to the appropriate reducer (assuming
1/5th of the data stays on the local machine).  So, each machine reads
about 820MB of data to send to other slaves.   And each machine also
receive 820MB of data.   Assuming scp performance, one machine
distributing its data to 4 machines can be done within ~30sec (27sec +
some scp startup overhead).   Suppose we do the copying out
sequentially, this means the shuffling can be done in ~150sec.    This
5x discrepancy is quite a bit larger than expected.

- The reduce's sort phase took 12sec to 43sec.   Since each reduce
task only have 1GB of data, this is running at 24MB/sec to 85MB/sec.
24MB/sec seems reasonable given that even with an in memory sort,
there is one local disk read, and one 3x replicated HDFS write.
However, 85MB/sec seems too fast (faster than raw local disk speed).

-------------------

Finally, I am new to java.  Any suggestions on what's a good profiling
tool to use for Hadoop?  would be nice if the profiler can help
identify cpu/network/disk bottlenecks.

thanks

bwolen

Reply via email to