I have two clusters upgraded to CDH2. One is performing fine, and the other
is EXTREMELY slow.
Some jobs that formerly took 90 seconds, take 20 to 50 minutes.
It is an HDFS issue from what I can tell.
The simple DFS benchmark with one map task shows the problem clearly. I have
looked at every difference I can find and am wondering where else to look to
track this down.
The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum
with a 'dd' write test.
top / iostat do not show any significant CPU usage or iowait times on any
machines in the cluster during the test.
ifconfig does not report any dropped packets or other errors on any machine in
the cluster. dmesg has nothing interesting.
The poorly performing cluster is on a slightly newer CentOS version:
Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64
x86_64 GNU/Linux (CentOS 5.4, recent patches)
Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64
GNU/Linux (CentOS 5.3, I think)
The performance is always poor, not sporadically poor. It is poor with M/R
tasks as well as non-M/R HDFS clients (i.e. sqoop).
Poor performance cluster (no other jobs active during the test):
---------------------------------------------------
$ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
-nrFiles 1 -fileSize 2000
10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 1000000
10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega
bytes, 1 files
10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1
files
10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1
10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391
10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0%
10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0%
10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100%
10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391
[snip]
10/04/16 13:43:07 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
10/04/16 13:43:07 INFO mapred.FileInputFormat: Date & time: Fri Apr
16 13:43:07 PDT 2010
10/04/16 13:43:07 INFO mapred.FileInputFormat: Number of files: 1
10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000
10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput mb/sec:
0.678296742615553
10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec:
0.6782967448234558
10/04/16 13:43:07 INFO mapred.FileInputFormat: IO rate std deviation:
9.568803140552889E-5
10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: 2992.913
------------------------------------
Good performance cluster (other jobs active during the test):
---------------------------------
hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
-nrFiles 1 -fileSize 2000
10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in
the classpath. Usage of hadoop-site.xml is deprecated. Instead use
core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of
core-default.xml, mapred-default.xml and hdfs-default.xml respectively
TestFDSIO.0.0.4
10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1
10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000
10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 1000000
10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000 mega
bytes, 1 files
10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1
files
10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process : 1
10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098
10/04/16 12:50:55 INFO mapred.JobClient: map 0% reduce 0%
10/04/16 12:51:22 INFO mapred.JobClient: map 100% reduce 0%
10/04/16 12:51:32 INFO mapred.JobClient: map 100% reduce 100%
10/04/16 12:51:32 INFO mapred.JobClient: Job complete: job_201003311607_4098
[snip]
10/04/16 12:51:32 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : write
10/04/16 12:51:32 INFO mapred.FileInputFormat: Date & time: Fri Apr
16 12:51:32 PDT 2010
10/04/16 12:51:32 INFO mapred.FileInputFormat: Number of files: 1
10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: 2000
10/04/16 12:51:32 INFO mapred.FileInputFormat: Throughput mb/sec:
92.47699634715865
10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec:
92.47699737548828
10/04/16 12:51:32 INFO mapred.FileInputFormat: IO rate std deviation:
0.008579127784432736
10/04/16 12:51:32 INFO mapred.FileInputFormat: Test exec time sec: 39.522
-------------------------------
On the namenode logs during the write, there are about 2 minute intervals
between block writes (64MB)
2010-04-16 12:56:41,081 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
blk_5304238373340657846_291044
2010-04-16 12:58:15,144 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.3.0.142:50010 is added to
blk_5304238373340657846_291044 size 67108864
2010-04-16 12:58:15,145 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.3.0.140:50010 is added to
blk_5304238373340657846_291044 size 67108864
2010-04-16 12:58:15,147 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added to
blk_5304238373340657846_291044 size 67108864
2010-04-16 12:58:15,148 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
blk_-4700875175596279045_291044
2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.3.0.147:50010 is added to
blk_-4700875175596279045_291044 size 67108864
2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.3.0.149:50010 is added to
blk_-4700875175596279045_291044 size 67108864
2010-04-16 12:58:15,794 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added to
blk_-4700875175596279045_291044 size 67108864
2010-04-16 12:58:15,795 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
blk_1059681682487510599_291044
On the node where the map task ran, its datanode log has similar chunks of time
while it receives the block:
2010-04-16 12:55:08,303 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_-7985493313278868469_291044 src: /10.3.0.146:48168 dest:
/10.3.0.146:50010
2010-04-16 12:56:42,281 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.3.0.146:48168, dest: /10.3.0.146:50010, bytes: 67108864, op: HDFS_WRITE,
cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset: 0, srvID:
DS-350876050-10.3.0.146-50010-1265338485169, blockid:
blk_-7985493313278868469_291044, duration: 93972113000
2010-04-16 12:56:42,282 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 2 for block blk_-7985493313278868469_291044 terminating
2010-04-16 12:56:42,285 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_5304238373340657846_291044 src: /10.3.0.146:48175 dest:
/10.3.0.146:50010
2010-04-16 12:58:16,350 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.3.0.146:48175, dest: /10.3.0.146:50010, bytes: 67108864, op: HDFS_WRITE,
cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset: 0, srvID:
DS-350876050-10.3.0.146-50010-1265338485169, blockid:
blk_5304238373340657846_291044, duration: 94059819000
2010-04-16 12:58:16,350 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 2 for block blk_5304238373340657846_291044 terminating
2010-04-16 12:58:16,353 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving block blk_-4700875175596279045_291044 src: /10.3.0.146:48182 dest:
/10.3.0.146:50010
2010-04-16 12:58:16,997 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.3.0.146:48182, dest: /10.3.0.146:50010, bytes: 67108864, op: HDFS_WRITE,
cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset: 0, srvID:
DS-350876050-10.3.0.146-50010-1265338485169, blockid:
blk_-4700875175596279045_291044, duration: 637761000
2010-04-16 12:58:16,997 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 2 for block blk_-4700875175596279045_291044 terminating
On a different datanode it looks essentially the same but fewer blocks, since
all blocks are written to the local node once but only (replication
-1)/(cluster size -1) times as often elsewhere.
I grabbed a few stack traces of the running map task, but I don't think there
is any surprises there -- its waiting on an HDFS write:
"DataStreamer for file /benchmarks/TestDFSIO/io_data/test_io_0 block
blk_6860895546192904948_291044" daemon prio=10 tid=0x00002aab08244000
nid=0x1031 runnable [0x0000000040c7a000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00002aaab363dc88> (a sun.nio.ch.Util$1)
- locked <0x00002aaab363dc78> (a java.util.Collections$UnmodifiableSet)
- locked <0x00002aaab363da50> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
- locked <0x00002aaaf89023c0> (a java.io.BufferedOutputStream)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
"main" prio=10 tid=0x0000000055fb6000 nid=0x1017 waiting for monitor entry
[0x000000004112c000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3006)
- waiting to lock <0x00002aaac36bb440> (a java.util.LinkedList)
- locked <0x00002aaac36bae60> (a
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
at
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
I can get some stack traces from the namenode and datanodes during this too.
Any other suggestions? Is there anything in a 0.19.2 to 0.20.1+ upgrade that
might end up improperly configured that would have this sort of symptom? Its
almost like HDFS just decides to sleep for a second between each 64K network
transfer. Non-hadoop network transfer (i.e. scp) seems fine.
Another bit of information -- its not just HDFS writes, the read test is almost
equally awful. (1.9MB/sec)
The Map task running the read test is in this stack trace:
"main" prio=10 tid=0x00000000554b3000 nid=0x3376 runnable [0x0000000040a1d000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00002aaaf7641740> (a sun.nio.ch.Util$1)
- locked <0x00002aaaf7641730> (a java.util.Collections$UnmodifiableSet)
- locked <0x00002aaaf7641508> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00002aaaf76400f8> (a java.io.BufferedInputStream)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
at
org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1283)
The namenode is not doing anything as far as I can tell in the stack traces.
The datanode that is being accessed for data at any given time seems to be in
this stack trace:
"org.apache.hadoop.hdfs.server.datanode.dataxcei...@7b90a2d" daemon prio=10
tid=0x00002aaaf43b8000 nid=0x79cc runnable [0x000000004433c000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00002aaae7ee8440> (a sun.nio.ch.Util$1)
- locked <0x00002aaae7ee8428> (a java.util.Collections$UnmodifiableSet)
- locked <0x00002aaae7ee8098> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:401)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
at java.lang.Thread.run(Thread.java:619)
So the behavior is as if the network is extremely slow, but it seems to only
affect Hadoop. Any ideas?
Thanks,
-Scott