Hi. I ran some performance tests (randomwrite/sort) on a small Hadoop cluster. The numbers are unexpected. Some numbers are so far off, I suspect that either I didn't tune all the right knobs, or I had wrong expectations. Hence I am seeking suggestions on how to tune the cluster and/or explanations on how underlying parts of Hadoop works.
Here is the setup: -------------------------- cluster: hadoop 0.12.3 jdk 1.6.0_01 HDFS file replication factor: 3 7 machines total 1 machine for namenode 1 machine for jobtracker 5 other machines for slaves (datanodes / mapreduce) machine spec: 2GHz duo core AMD Opteron 8GB memory 1 Gbit ethernet 64-bit Linux 2.6.18-8.1.1.el5 measured disk/network performance: ~75MB/sec disk write (tested by "dd if=/dev/zero of=$f ..." to generate 20GB file) ~76MB/sec disk read (tested by "cat $f > /dev/null" on the 20GB generated above) ~30MB/sec for copying file from one machine to another (tested by scp) on 20GB file). I hope the 20GB size is large enough to alleviate the effects of in memory file caching and/or network traffic fluctuations. Single file Test ----------------------- I started this test by reformatting the DFS, then tweaked the random writer to run only one mapper so that it produces only 1 file. Looks like when writing out only one file, the machine that ran the map gets twice as much data written on its disk than the other datanodes. Any idea why this is happening and how to get a more even distribution? Just making sure, does 3x replication guarantees that copies of the data will be kept on 3 different machines? (hopefully, the DFS is not assigning 2 copies of the data blocks to that mapper machine). Multiple files write --------------------------- Given that the single-file case, the data is not evenly distributed, I reformat the DFS and reran random writer for 5GB output with 5 mappers (i.e., 1GB for each mapper). This took 294secs with no map/reduce task failures. Here are some question on this test: - This run used up 25GB of DFS. I was expecting 3x replication to mean only 15GB is used. What is using the other 10GB? The 25GB usage is computed based on name nodes' output on "capacity" and "used" percentage. It confirms with the "remaining" stat. Just in case I am reading this wrong, the "Live Datanodes" table's "blocks" column adds up to 288 blocks. Is the block size the same as the DFS block size (which is 64MB in my case)? If so, this means 18GB worth of blocks. This is closer but the numbers still don't seem to add up (what happened to the other 3GB of diskspace, and why it doesn't match "Capacity"/"Remaining"/"Used" stats. - Including replication, 15GB gets written. Given that we have 5 machines/disks writing in parallel, each machine is writing at about 10.4MB/sec, which is about 1/7th of raw disk throughput. Is this expected, or are there parameters that I can set to improve this? Sorter ---------- I use the Sorter example given on the 5x1GB files generated by RandomWriter. It ran with 80 map tasks and 6 reducers, and took 1345sec to complete, and there are no map/reduce task failures. Looking more closely, a few questions: - it ran 10 maps at a time, is there a way to run only 5 maps at a time (and hopefully the scheduler will be able to schedule 1 map on each machine accessing only local data). - a few mapper processed 67MB instead of 64MB. why? (I had thought DFS block size is the limit). - the fastest map task took 59sec, and the slowest took 139 sec. If it were purely local disk access, reading and writing <70MB of data should have taken only a few secs total. Any idea why there is such a large performance discrepancy and how to tune this? Is there a way to check what % of tasks are working from local disk vs from remote datanodes? - The reduce tasks all spent about 760 sec in the shuffle phase. Since we are sorting 5GB of data on 5 machines, I assume ~4 GB get transferred over the network to the appropriate reducer (assuming 1/5th of the data stays on the local machine). So, each machine reads about 820MB of data to send to other slaves. And each machine also receive 820MB of data. Assuming scp performance, one machine distributing its data to 4 machines can be done within ~30sec (27sec + some scp startup overhead). Suppose we do the copying out sequentially, this means the shuffling can be done in ~150sec. This 5x discrepancy is quite a bit larger than expected. - The reduce's sort phase took 12sec to 43sec. Since each reduce task only have 1GB of data, this is running at 24MB/sec to 85MB/sec. 24MB/sec seems reasonable given that even with an in memory sort, there is one local disk read, and one 3x replicated HDFS write. However, 85MB/sec seems too fast (faster than raw local disk speed). ------------------- Finally, I am new to java. Any suggestions on what's a good profiling tool to use for Hadoop? would be nice if the profiler can help identify cpu/network/disk bottlenecks. thanks bwolen
