How much memory did you allocate to the regionservers ? Cheers
On Wed, Apr 14, 2010 at 8:27 PM, Geoff Hendrey <ghend...@decarta.com> wrote: > Hi, > > I have posted previously about issues I was having with HDFS when I was > running HBase and HDFS on the same box both pseudoclustered. Now I have > two very capable servers. I've setup HDFS with a datanode on each box. > I've setup the namenode on one box, and the zookeeper and HDFS master on > the other box. Both boxes are region servers. I am using hadoop 20.2 and > hbase 20.3. > > I have set dfs.datanode.socket.write.timeout to 0 in hbase-site.xml. > > I am running a mapreduce job with about 200 concurrent reducers, each of > which writes into HBase, with 32,000 row flush buffers. About 40% > through the completion of my job, HDFS started showing one of the > datanodes was dead (the one *not* on the same machine as the namenode). > I stopped HBase, and magically the datanode came back to life. > > Any suggestions on how to increase the robustness? > > > I see errors like this in the datanode's log: > > 2010-04-14 12:54:58,692 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: D > atanodeRegistration(10.241.6.80:50010, > storageID=DS-642079670-10.241.6.80-50010- > 1271178858027, infoPort=50075, ipcPort=50020):DataXceiver > java.net.SocketTimeoutException: 480000 millis timeout while waiting for > channel > to be ready for write. ch : java.nio.channels.SocketChannel[connected > local=/10 > .241.6.80:50010 remote=/10.241.6.80:48320] > at > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeo > ut.java:246) > at > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutput > Stream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutput > Stream.java:198) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSe > nder.java:313) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSen > der.java:400) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXcei > ver.java:180) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.ja > : > > > Here I show the output of 'hadoop dfsadmin -report'. First time it is > invoked, all is well. Second time, one datanode is dead. Third time, the > dead datanode has come back to life.: > > [had...@dt1 ~]$ hadoop dfsadmin -report > Configured Capacity: 1277248323584 (1.16 TB) > Present Capacity: 1208326105528 (1.1 TB) > DFS Remaining: 1056438108160 (983.88 GB) > DFS Used: 151887997368 (141.46 GB) > DFS Used%: 12.57% > Under replicated blocks: 3479 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > > ------------------------------------------------- > Datanodes available: 2 (2 total, 0 dead) > > Name: 10.241.6.79:50010 > Decommission Status : Normal > Configured Capacity: 643733970944 (599.52 GB) > DFS Used: 75694104268 (70.5 GB) > Non DFS Used: 35150238004 (32.74 GB) > DFS Remaining: 532889628672(496.29 GB) > DFS Used%: 11.76% > DFS Remaining%: 82.78% > Last contact: Wed Apr 14 11:20:59 PDT 2010 > > > Name: 10.241.6.80:50010 > Decommission Status : Normal > Configured Capacity: 633514352640 (590.01 GB) > DFS Used: 76193893100 (70.96 GB) > Non DFS Used: 33771980052 (31.45 GB) > DFS Remaining: 523548479488(487.59 GB) > DFS Used%: 12.03% > DFS Remaining%: 82.64% > Last contact: Wed Apr 14 11:14:37 PDT 2010 > > > [had...@dt1 ~]$ hadoop dfsadmin -report > Configured Capacity: 643733970944 (599.52 GB) > Present Capacity: 609294929920 (567.45 GB) > DFS Remaining: 532876144640 (496.28 GB) > DFS Used: 76418785280 (71.17 GB) > DFS Used%: 12.54% > Under replicated blocks: 3247 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > > ------------------------------------------------- > Datanodes available: 1 (2 total, 1 dead) > > Name: 10.241.6.79:50010 > Decommission Status : Normal > Configured Capacity: 643733970944 (599.52 GB) > DFS Used: 76418785280 (71.17 GB) > Non DFS Used: 34439041024 (32.07 GB) > DFS Remaining: 532876144640(496.28 GB) > DFS Used%: 11.87% > DFS Remaining%: 82.78% > Last contact: Wed Apr 14 11:28:38 PDT 2010 > > > Name: 10.241.6.80:50010 > Decommission Status : Normal > Configured Capacity: 0 (0 KB) > DFS Used: 0 (0 KB) > Non DFS Used: 0 (0 KB) > DFS Remaining: 0(0 KB) > DFS Used%: 100% > DFS Remaining%: 0% > Last contact: Wed Apr 14 11:14:37 PDT 2010 > > > [had...@dt1 ~]$ hadoop dfsadmin -report > Configured Capacity: 1277248323584 (1.16 TB) > Present Capacity: 1210726427080 (1.1 TB) > DFS Remaining: 1055440003072 (982.96 GB) > DFS Used: 155286424008 (144.62 GB) > DFS Used%: 12.83% > Under replicated blocks: 3338 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > > ------------------------------------------------- > Datanodes available: 2 (2 total, 0 dead) > > Name: 10.241.6.79:50010 > Decommission Status : Normal > Configured Capacity: 643733970944 (599.52 GB) > DFS Used: 77775145981 (72.43 GB) > Non DFS Used: 33086850051 (30.81 GB) > DFS Remaining: 532871974912(496.28 GB) > DFS Used%: 12.08% > DFS Remaining%: 82.78% > Last contact: Wed Apr 14 11:29:44 PDT 2010 > > > Name: 10.241.6.80:50010 > Decommission Status : Normal > Configured Capacity: 633514352640 (590.01 GB) > DFS Used: 77511278027 (72.19 GB) > Non DFS Used: 33435046453 (31.14 GB) > DFS Remaining: 522568028160(486.68 GB) > DFS Used%: 12.24% > DFS Remaining%: 82.49% > Last contact: Wed Apr 14 11:29:44 PDT 2010 > > > >