Re: Extremely slow HDFS after upgrade

Scott Carey Fri, 16 Apr 2010 19:12:24 -0700

Ok, so here is a ... fun result.

I have dfs.replication.min set to 2, so I can't just do
hsdoop fs -Ddfs.replication=1 put someFile someFile
Since that will fail.


So here are two results that are fascinating:

$ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar
real    1m53.237s
user    0m1.952s
sys     0m0.308s

$ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar
real    0m1.689s
user    0m1.763s
sys     0m0.315s



The file is 77MB and so is two blocks.
The test with replication level 3 is slow about 9 out of 10 times.  When it is 
slow it sometimes is 28 seconds, sometimes 2 minutes.  It was fast one time...
The test with replication level 2 is fast in 40 out of 40 tests.

This is a development cluster with 8 nodes.

It looks like the replication level of 3 or more causes trouble.  Looking more 
closely at the logs, it seems that certain datanodes (but not all) cause large 
delays if they are in the middle of an HDFS write chain.  So, a write that goes 
from A > B > C is fast if B is a good node and C a bad node.  If its A > C > B 
then its slow.

So, I can say that some nodes but not all are doing something wrong. when in 
the middle of a write chain.  If I do a replication = 2 write on one of these 
bad nodes, its always slow.

So the good news is I can identify the bad nodes, and decomission them.  The 
bad news is this still doesn't make a lot of sense, and 40% of the nodes have 
the issue.  Worse, on a couple nodes the behavior in the replication = 2 case 
is not consistent -- sometimes the first block is fast.  So it may be dependent 
on not just the source, but the source <> target combination in the chain.


At this point, I suspect something completely broken at the network level, 
perhaps even routing.  Why it would show up after an upgrade is yet to be 
determined, but the upgrade did include some config changes and OS updates.

Thanks Todd!

-Scott


On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote:

> Hey Scott,
>
> This is indeed really strange... if you do a straight "hadoop fs -put" with
> dfs.replication set to 1 from one of the DNs, does it upload slow? That
> would cut out the network from the equation.
>
> -Todd
>
> On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey <[email protected]>wrote:
>
>> I have two clusters upgraded to CDH2.   One is performing fine, and the
>> other is EXTREMELY slow.
>>
>> Some jobs that formerly took 90 seconds, take 20 to 50 minutes.
>>
>> It is an HDFS issue from what I can tell.
>>
>> The simple DFS benchmark with one map task shows the problem clearly.  I
>> have looked at every difference I can find and am wondering where else to
>> look to track this down.
>> The disks on all nodes in the cluster check out -- capable of 75MB/sec
>> minimum with a 'dd' write test.
>> top / iostat do not show any significant CPU usage or iowait times on any
>> machines in the cluster during the test.
>> ifconfig does not report any dropped packets or other errors on any machine
>> in the cluster.  dmesg has nothing interesting.
>> The poorly performing cluster is on a slightly newer CentOS version:
>> Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64
>> x86_64 GNU/Linux  (CentOS 5.4, recent patches)
>> Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
>> x86_64 GNU/Linux  (CentOS 5.3, I think)
>> The performance is always poor, not sporadically poor.  It is poor with M/R
>> tasks as well as non-M/R HDFS clients (i.e. sqoop).
>>
>> Poor performance cluster (no other jobs active during the test):
>> ---------------------------------------------------
>> $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
>> -nrFiles 1 -fileSize 2000
>> 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
>> 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
>> 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 1000000
>> 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000
>> mega bytes, 1 files
>> 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1
>> files
>> 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same.
>> 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process
>> : 1
>> 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391
>> 10/04/16 12:53:16 INFO mapred.JobClient:  map 0% reduce 0%
>> 10/04/16 13:42:30 INFO mapred.JobClient:  map 100% reduce 0%
>> 10/04/16 13:43:06 INFO mapred.JobClient:  map 100% reduce 100%
>> 10/04/16 13:43:07 INFO mapred.JobClient: Job complete:
>> job_201004091928_0391
>> [snip]
>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: ----- TestDFSIO ----- :
>> write
>> 10/04/16 13:43:07 INFO mapred.FileInputFormat:            Date & time: Fri
>> Apr 16 13:43:07 PDT 2010
>> 10/04/16 13:43:07 INFO mapred.FileInputFormat:        Number of files: 1
>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000
>> 10/04/16 13:43:07 INFO mapred.FileInputFormat:      Throughput mb/sec:
>> 0.678296742615553
>> 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec:
>> 0.6782967448234558
>> 10/04/16 13:43:07 INFO mapred.FileInputFormat:  IO rate std deviation:
>> 9.568803140552889E-5
>> 10/04/16 13:43:07 INFO mapred.FileInputFormat:     Test exec time sec:
>> 2992.913
>> ------------------------------------
>>
>> Good performance cluster (other jobs active during the test):
>> ---------------------------------
>> hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
>> -nrFiles 1 -fileSize 2000
>> 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml
>> found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use
>> core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of
>> core-default.xml, mapred-default.xml and hdfs-default.xml respectively
>> TestFDSIO.0.0.4
>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1
>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000
>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 1000000
>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000
>> mega bytes, 1 files
>> 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1
>> files
>> 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same.
>> 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process
>> : 1
>> 10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098
>> 10/04/16 12:50:55 INFO mapred.JobClient:  map 0% reduce 0%
>> 10/04/16 12:51:22 INFO mapred.JobClient:  map 100% reduce 0%
>> 10/04/16 12:51:32 INFO mapred.JobClient:  map 100% reduce 100%
>> 10/04/16 12:51:32 INFO mapred.JobClient: Job complete:
>> job_201003311607_4098
>> [snip]
>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: ----- TestDFSIO ----- :
>> write
>> 10/04/16 12:51:32 INFO mapred.FileInputFormat:            Date & time: Fri
>> Apr 16 12:51:32 PDT 2010
>> 10/04/16 12:51:32 INFO mapred.FileInputFormat:        Number of files: 1
>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: 2000
>> 10/04/16 12:51:32 INFO mapred.FileInputFormat:      Throughput mb/sec:
>> 92.47699634715865
>> 10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec:
>> 92.47699737548828
>> 10/04/16 12:51:32 INFO mapred.FileInputFormat:  IO rate std deviation:
>> 0.008579127784432736
>> 10/04/16 12:51:32 INFO mapred.FileInputFormat:     Test exec time sec:
>> 39.522
>> -------------------------------
>>
>> On the namenode logs during the write, there are about 2 minute intervals
>> between block writes (64MB)
>> 2010-04-16 12:56:41,081 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
>> blk_5304238373340657846_291044
>> 2010-04-16 12:58:15,144 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.142:50010 is added to
>> blk_5304238373340657846_291044 size 67108864
>> 2010-04-16 12:58:15,145 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.140:50010 is added to
>> blk_5304238373340657846_291044 size 67108864
>> 2010-04-16 12:58:15,147 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added to
>> blk_5304238373340657846_291044 size 67108864
>> 2010-04-16 12:58:15,148 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
>> blk_-4700875175596279045_291044
>> 2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.147:50010 is added to
>> blk_-4700875175596279045_291044 size 67108864
>> 2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.149:50010 is added to
>> blk_-4700875175596279045_291044 size 67108864
>> 2010-04-16 12:58:15,794 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added to
>> blk_-4700875175596279045_291044 size 67108864
>> 2010-04-16 12:58:15,795 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0.
>> blk_1059681682487510599_291044
>>
>> On the node where the map task ran, its datanode log has similar chunks of
>> time while it receives the block:
>> 2010-04-16 12:55:08,303 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_-7985493313278868469_291044 src: /10.3.0.146:48168 dest: /
>> 10.3.0.146:50010
>> 2010-04-16 12:56:42,281 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.3.0.146:48168, dest: /10.3.0.146:50010, bytes: 67108864, op:
>> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset:
>> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid:
>> blk_-7985493313278868469_291044, duration: 93972113000
>> 2010-04-16 12:56:42,282 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
>> blk_-7985493313278868469_291044 terminating
>> 2010-04-16 12:56:42,285 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_5304238373340657846_291044 src: /10.3.0.146:48175 dest: /
>> 10.3.0.146:50010
>> 2010-04-16 12:58:16,350 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.3.0.146:48175, dest: /10.3.0.146:50010, bytes: 67108864, op:
>> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset:
>> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid:
>> blk_5304238373340657846_291044, duration: 94059819000
>> 2010-04-16 12:58:16,350 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
>> blk_5304238373340657846_291044 terminating
>> 2010-04-16 12:58:16,353 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_-4700875175596279045_291044 src: /10.3.0.146:48182 dest: /
>> 10.3.0.146:50010
>> 2010-04-16 12:58:16,997 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.3.0.146:48182, dest: /10.3.0.146:50010, bytes: 67108864, op:
>> HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset:
>> 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid:
>> blk_-4700875175596279045_291044, duration: 637761000
>> 2010-04-16 12:58:16,997 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
>> blk_-4700875175596279045_291044 terminating
>>
>>
>> On a different datanode it looks essentially the same but fewer blocks,
>> since all blocks are written to the local node once but only (replication
>> -1)/(cluster size -1) times as often elsewhere.
>>
>> I grabbed a few stack traces of the running map task, but I don't think
>> there is any surprises there -- its waiting on an HDFS write:
>> "DataStreamer for file /benchmarks/TestDFSIO/io_data/test_io_0 block
>> blk_6860895546192904948_291044" daemon prio=10 tid=0x00002aab08244000
>> nid=0x1031 runnable [0x0000000040c7a000]
>>  java.lang.Thread.State: RUNNABLE
>>       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>       - locked <0x00002aaab363dc88> (a sun.nio.ch.Util$1)
>>       - locked <0x00002aaab363dc78> (a
>> java.util.Collections$UnmodifiableSet)
>>       - locked <0x00002aaab363da50> (a sun.nio.ch.EPollSelectorImpl)
>>       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>       at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
>>       at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>       at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>       at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>       at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>>       - locked <0x00002aaaf89023c0> (a java.io.BufferedOutputStream)
>>       at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>
>> "main" prio=10 tid=0x0000000055fb6000 nid=0x1017 waiting for monitor entry
>> [0x000000004112c000]
>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>       at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3006)
>>       - waiting to lock <0x00002aaac36bb440> (a java.util.LinkedList)
>>       - locked <0x00002aaac36bae60> (a
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
>>       at
>> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
>>
>>
>>
>> I can get some stack traces from the namenode and datanodes during this
>> too.  Any other suggestions?  Is there anything in a 0.19.2  to 0.20.1+
>> upgrade that might end up improperly configured that would have this sort of
>> symptom?  Its almost like HDFS just decides to sleep for a second between
>> each 64K network transfer.  Non-hadoop network transfer (i.e. scp) seems
>> fine.
>>
>> Another bit of information -- its not just HDFS writes, the read test is
>> almost equally awful. (1.9MB/sec)
>>
>> The Map task running the read test is in this stack trace:
>> "main" prio=10 tid=0x00000000554b3000 nid=0x3376 runnable
>> [0x0000000040a1d000]
>>  java.lang.Thread.State: RUNNABLE
>>       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>       - locked <0x00002aaaf7641740> (a sun.nio.ch.Util$1)
>>       - locked <0x00002aaaf7641730> (a
>> java.util.Collections$UnmodifiableSet)
>>       - locked <0x00002aaaf7641508> (a sun.nio.ch.EPollSelectorImpl)
>>       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>       at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
>>       at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>       at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>       at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>>       at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>       - locked <0x00002aaaf76400f8> (a java.io.BufferedInputStream)
>>       at java.io.DataInputStream.read(DataInputStream.java:132)
>>       at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>>       at
>> org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1283)
>>
>>
>> The namenode is not doing anything as far as I can tell in the stack
>> traces.  The datanode that is being accessed for data at any given time
>> seems to be in this stack trace:
>> "org.apache.hadoop.hdfs.server.datanode.dataxcei...@7b90a2d" daemon
>> prio=10 tid=0x00002aaaf43b8000 nid=0x79cc runnable [0x000000004433c000]
>>  java.lang.Thread.State: RUNNABLE
>>       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>>       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>>       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>>       - locked <0x00002aaae7ee8440> (a sun.nio.ch.Util$1)
>>       - locked <0x00002aaae7ee8428> (a
>> java.util.Collections$UnmodifiableSet)
>>       - locked <0x00002aaae7ee8098> (a sun.nio.ch.EPollSelectorImpl)
>>       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>>       at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
>>       at
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>>       at
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>       at
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>       at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
>>       at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:401)
>>       at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
>>       at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>>       at java.lang.Thread.run(Thread.java:619)
>>
>>
>> So the behavior is as if the network is extremely slow, but it seems to
>> only affect Hadoop.  Any ideas?
>>
>> Thanks,
>>
>> -Scott
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

Re: Extremely slow HDFS after upgrade

Reply via email to