On 28/09/11 22:45, Sameer Farooqui wrote:
Hi everyone,

I'm looking for some recommendations for how to get our Hadoop cluster to do
faster I/O.

Currently, our lab cluster is 8 worker nodes and 1 master node (with
NameNode and JobTracker).

Each worker node has:
- 48 GB RAM
- 16 processors (Intel Xeon E5630 @ 2.53 GHz)
- 1 Gb Ethernet connection


Due to company policy, we have to keep the HDFS storage on a disk array. Our
SAS connected array is capable of 6 Gb (768 MB) for each of the 8 hosts. So,
theoretically, we should be able to get a max of 6 GB simultaneous reads
across the 8 nodes if we benchmark it.

missing the point on Hadoop there; you will end up getting the bandwidth of the HDD most likely to fail next, copy replication is overkill and you will reach limits on scale both technical (SAN scalability) and financial.


Our disk array is presenting each of the 8 nodes with a 21 TB LUN. The LUN
is RAID-5 across 12 disks on the array. That LUN is partitioned on the
server into 6 different devices like this:




The file system type is ext3.

set noatime


So, when we run TestDFSIO, here are the results:

*++ Write ++*
hadoop jar /usr/lib/hadoop/hadoop-test-0.20.2-CDH3B4.jar TestDFSIO -write
-nrFiles 80 -fileSize 10000

11/09/27 18:54:53 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
11/09/27 18:54:53 INFO fs.TestDFSIO:            Date&  time: Tue Sep 27
18:54:53 EDT 2011
11/09/27 18:54:53 INFO fs.TestDFSIO:        Number of files: 80
11/09/27 18:54:53 INFO fs.TestDFSIO: Total MBytes processed: 800000
11/09/27 18:54:53 INFO fs.TestDFSIO:      Throughput mb/sec: 8.2742240008678
11/09/27 18:54:53 INFO fs.TestDFSIO: Average IO rate mb/sec:
8.288116455078125
11/09/27 18:54:53 INFO fs.TestDFSIO:  IO rate std deviation:
0.3435565217052116
11/09/27 18:54:53 INFO fs.TestDFSIO:     Test exec time sec: 1427.856

So, throughput across all 8 nodes is 8.27 * 80 = 661 MB per second.


*++ Read ++*
hadoop jar /usr/lib/hadoop/hadoop-test-0.20.2-CDH3B4.jar TestDFSIO -read
-nrFiles 80 -fileSize 10000

11/09/27 19:43:12 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
11/09/27 19:43:12 INFO fs.TestDFSIO:            Date&  time: Tue Sep 27
19:43:12 EDT 2011
11/09/27 19:43:12 INFO fs.TestDFSIO:        Number of files: 80
11/09/27 19:43:12 INFO fs.TestDFSIO: Total MBytes processed: 800000
11/09/27 19:43:12 INFO fs.TestDFSIO:      Throughput mb/sec:
5.854318503905489
11/09/27 19:43:12 INFO fs.TestDFSIO: Average IO rate mb/sec:
5.96372652053833
11/09/27 19:43:12 INFO fs.TestDFSIO:  IO rate std deviation:
0.9885505979030621
11/09/27 19:43:12 INFO fs.TestDFSIO:     Test exec time sec: 2055.465


So, throughput across all 8 nodes is 5.85 * 80 = 468 MB per second.


*Question 1:* Why are the reads and writes so much slower than expected? Any
suggestions about what can be changed? I understand that RAID-5 backed disks
are an unorthodox configuration for HDFS, but has anybody successfully done
this? If so, what kind of results did you see?




Also, we detached the 8 nodes from the disk array and connected each of them
to 6 local hard drives for testing (w/ ext4 file system). Then we ran the
same read TestDFSIO and saw this:

11/09/26 20:24:09 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
11/09/26 20:24:09 INFO fs.TestDFSIO:            Date&  time: Mon Sep 26
20:24:09 EDT 2011
11/09/26 20:24:09 INFO fs.TestDFSIO:        Number of files: 80
11/09/26 20:24:09 INFO fs.TestDFSIO: Total MBytes processed: 800000
11/09/26 20:24:09 INFO fs.TestDFSIO:      Throughput mb/sec:
13.065623285187982
11/09/26 20:24:09 INFO fs.TestDFSIO: Average IO rate mb/sec:
15.160531997680664
11/09/26 20:24:09 INFO fs.TestDFSIO:  IO rate std deviation:
8.000530562022949
11/09/26 20:24:09 INFO fs.TestDFSIO:     Test exec time sec: 1123.447


So, with local disks, reads are about 1 GB per second across the 8 nodes.
Much faster!

Much lower cost per TB too. Orders of magnitude lower.


With 6 local disks, writes performed the same though:

11/09/26 19:49:58 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
11/09/26 19:49:58 INFO fs.TestDFSIO:            Date&  time: Mon Sep 26
19:49:58 EDT 2011
11/09/26 19:49:58 INFO fs.TestDFSIO:        Number of files: 80
11/09/26 19:49:58 INFO fs.TestDFSIO: Total MBytes processed: 800000
11/09/26 19:49:58 INFO fs.TestDFSIO:      Throughput mb/sec:
8.573949802610528
11/09/26 19:49:58 INFO fs.TestDFSIO: Average IO rate mb/sec:
8.588902473449707
11/09/26 19:49:58 INFO fs.TestDFSIO:  IO rate std deviation:
0.3639466752546032
11/09/26 19:49:58 INFO fs.TestDFSIO:     Test exec time sec: 1383.734


Write throughput across the cluster was 685 MB per second.

Writes get streamed to multiple HDFS nodes for redundancy; you've got the bandwidth + network overhead and 3x the data.


Options
-stop using HDFS on the SAN, it's the wrong approach. Mount the SAN directly and use file:// URLs, let the SAN do the networking and redundancy. -buy some local HDDs at least for all the temp data: logs, overspill mapreduce.tmp.dir. You don't need redundancy here

Reply via email to