Following exception had been shown in my cluster. I never saw such error message after setting dfs.datanode.socket.write.timeout = 0
hdfs-site.xml <property> <name>dfs.datanode.socket.write.timeout</name> <value>0</value> </property> Fleming Chiu(邱宏明) 707-6128 y_823...@tsmc.com 週一無肉日吃素救地球(Meat Free Monday Taiwan) "Geoff Hendrey" <ghend...@decarta To: <hbase-user@hadoop.apache.org> .com> cc: "Paul Mahon" <pma...@decarta.com>, "Bill Brune" <bbr...@decarta.com>, "Shaheen Bahauddin" <sbahaud...@decarta.com>, "Rohit Nigam" <rni...@decarta.com>, (bcc: Y_823910/TSMC) 2010/04/15 11:27 Subject: Region server goes away AM Please respond to hbase-user Hi, I have posted previously about issues I was having with HDFS when I was running HBase and HDFS on the same box both pseudoclustered. Now I have two very capable servers. I've setup HDFS with a datanode on each box. I've setup the namenode on one box, and the zookeeper and HDFS master on the other box. Both boxes are region servers. I am using hadoop 20.2 and hbase 20.3. I have set dfs.datanode.socket.write.timeout to 0 in hbase-site.xml. I am running a mapreduce job with about 200 concurrent reducers, each of which writes into HBase, with 32,000 row flush buffers. About 40% through the completion of my job, HDFS started showing one of the datanodes was dead (the one *not* on the same machine as the namenode). I stopped HBase, and magically the datanode came back to life. Any suggestions on how to increase the robustness? I see errors like this in the datanode's log: 2010-04-14 12:54:58,692 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: D atanodeRegistration(10.241.6.80:50010, storageID=DS-642079670-10.241.6.80-50010- 1271178858027, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10 .241.6.80:50010 remote=/10.241.6.80:48320] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeo ut.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutput Stream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutput Stream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSe nder.java:313) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSen der.java:400) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXcei ver.java:180) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.ja : Here I show the output of 'hadoop dfsadmin -report'. First time it is invoked, all is well. Second time, one datanode is dead. Third time, the dead datanode has come back to life.: [had...@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 1277248323584 (1.16 TB) Present Capacity: 1208326105528 (1.1 TB) DFS Remaining: 1056438108160 (983.88 GB) DFS Used: 151887997368 (141.46 GB) DFS Used%: 12.57% Under replicated blocks: 3479 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 2 (2 total, 0 dead) Name: 10.241.6.79:50010 Decommission Status : Normal Configured Capacity: 643733970944 (599.52 GB) DFS Used: 75694104268 (70.5 GB) Non DFS Used: 35150238004 (32.74 GB) DFS Remaining: 532889628672(496.29 GB) DFS Used%: 11.76% DFS Remaining%: 82.78% Last contact: Wed Apr 14 11:20:59 PDT 2010 Name: 10.241.6.80:50010 Decommission Status : Normal Configured Capacity: 633514352640 (590.01 GB) DFS Used: 76193893100 (70.96 GB) Non DFS Used: 33771980052 (31.45 GB) DFS Remaining: 523548479488(487.59 GB) DFS Used%: 12.03% DFS Remaining%: 82.64% Last contact: Wed Apr 14 11:14:37 PDT 2010 [had...@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 643733970944 (599.52 GB) Present Capacity: 609294929920 (567.45 GB) DFS Remaining: 532876144640 (496.28 GB) DFS Used: 76418785280 (71.17 GB) DFS Used%: 12.54% Under replicated blocks: 3247 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 1 (2 total, 1 dead) Name: 10.241.6.79:50010 Decommission Status : Normal Configured Capacity: 643733970944 (599.52 GB) DFS Used: 76418785280 (71.17 GB) Non DFS Used: 34439041024 (32.07 GB) DFS Remaining: 532876144640(496.28 GB) DFS Used%: 11.87% DFS Remaining%: 82.78% Last contact: Wed Apr 14 11:28:38 PDT 2010 Name: 10.241.6.80:50010 Decommission Status : Normal Configured Capacity: 0 (0 KB) DFS Used: 0 (0 KB) Non DFS Used: 0 (0 KB) DFS Remaining: 0(0 KB) DFS Used%: 100% DFS Remaining%: 0% Last contact: Wed Apr 14 11:14:37 PDT 2010 [had...@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 1277248323584 (1.16 TB) Present Capacity: 1210726427080 (1.1 TB) DFS Remaining: 1055440003072 (982.96 GB) DFS Used: 155286424008 (144.62 GB) DFS Used%: 12.83% Under replicated blocks: 3338 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 2 (2 total, 0 dead) Name: 10.241.6.79:50010 Decommission Status : Normal Configured Capacity: 643733970944 (599.52 GB) DFS Used: 77775145981 (72.43 GB) Non DFS Used: 33086850051 (30.81 GB) DFS Remaining: 532871974912(496.28 GB) DFS Used%: 12.08% DFS Remaining%: 82.78% Last contact: Wed Apr 14 11:29:44 PDT 2010 Name: 10.241.6.80:50010 Decommission Status : Normal Configured Capacity: 633514352640 (590.01 GB) DFS Used: 77511278027 (72.19 GB) Non DFS Used: 33435046453 (31.14 GB) DFS Remaining: 522568028160(486.68 GB) DFS Used%: 12.24% DFS Remaining%: 82.49% Last contact: Wed Apr 14 11:29:44 PDT 2010 --------------------------------------------------------------------------- TSMC PROPERTY This email communication (and any attachments) is proprietary information for the sole use of its intended recipient. Any unauthorized review, use or distribution by anyone other than the intended recipient is strictly prohibited. If you are not the intended recipient, please notify the sender by replying to this email, and then delete this email and any copies of it immediately. Thank you. ---------------------------------------------------------------------------