I find if set "dfs.datanode.socket.write.timeout=0", hadoop will always create new socket, is it ok?
On Wed, Mar 25, 2009 at 5:01 PM, schubert zhang <[email protected]> wrote: > Hi all, > I also meet such same problems/exceptions. > I also have 5+1 machine,e and the system has been running for about 4 days, > and there are 512 regions now. But the two > exceptions start to happen earlyer. > > hadoop-0.19 > hbase-0.19.1 (with patch > https://issues.apache.org/jira/browse/HBASE-1008).<https://issues.apache.org/jira/browse/HBASE-1008> > > I want to try to set dfs.datanode.socket.write.timeout=0 and watch it > later. > > Schubert > > > On Sat, Mar 7, 2009 at 3:15 AM, stack <[email protected]> wrote: > >> On Wed, Mar 4, 2009 at 9:18 AM, <[email protected]> wrote: >> >> > <property> >> > <name>dfs.replication</name> >> > <value>2</value> >> > <description>Default block replication. >> > The actual number of replications can be specified when the file is >> > created. >> > The default is used if replication is not specified in create time. >> > </description> >> > </property> >> > >> > <property> >> > <name>dfs.block.size</name> >> > <value>8388608</value> >> > <description>The hbase standard size for new files.</description> >> > <!--<value>67108864</value>--> >> > <!--<description>The default block size for new files.</description>--> >> > </property> >> > >> >> >> The above are non-standard. A replication of 3 might lessen the incidence >> of HDFS errors seen since there will be another replica to go to. Why >> non-standard block size? >> >> I did not see *dfs.datanode.socket.write.timeout* set to 0. Is that >> because >> you are running w/ 0.19.0? You might try with it especially because in >> the >> below I see complaint about the timeout (but more below on this). >> >> >> >> > <property> >> > <name>hbase.hstore.blockCache.blockSize</name> >> > <value>65536</value> >> > <description>The size of each block in the block cache. >> > Enable blockcaching on a per column family basis; see the BLOCKCACHE >> > setting >> > in HColumnDescriptor. Blocks are kept in a java Soft Reference cache >> so >> > are >> > let go when high pressure on memory. Block caching is not enabled by >> > default. >> > Default is 16384. >> > </description> >> > </property> >> > >> >> >> Are you using blockcaching? If so, 64k was problematic in my testing >> (OOMEing). >> >> >> >> >> > Case 1: >> > >> > On HBase Regionserver: >> > >> > 2009-02-27 04:23:52,185 INFO org.apache.hadoop.hdfs.DFSClient: >> > org.apache.hadoop.ipc.RemoteException: >> > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not >> > replicated >> > >> yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data >> > at >> > >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256) >> > at >> > >> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351) >> > at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) >> > at >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) >> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) >> > >> > at org.apache.hadoop.ipc.Client.call(Client.java:696) >> > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) >> > at $Proxy1.addBlock(Unknown Source) >> > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) >> > at >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > at >> > >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) >> > at >> > >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) >> > at $Proxy1.addBlock(Unknown Source) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) >> > >> > >> > On Hadoop Datanode: >> > >> > 2009-02-27 04:22:58,110 WARN >> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >> > 10.1.188.249:50010, >> storageID=DS-1180278657-127.0.0.1-50010-1235652659245, >> > infoPort=50075, ipcPort=50020):Got exception while serving >> > blk_5465578316105624003_26301 to /10.1.188.249: >> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for >> > channel to be ready for write. ch : >> > java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 >> remote=/ >> > 10.1.188.249:48326] >> > at >> > >> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94) >> > at java.lang.Thread.run(Thread.java:619) >> > >> > 2009-02-27 04:22:58,110 ERROR >> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >> > 10.1.188.249:50010, >> storageID=DS-1180278657-127.0.0.1-50010-1235652659245, >> > infoPort=50075, ipcPort=50020):DataXceiver >> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for >> > channel to be ready for write. ch : >> > java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 >> remote=/ >> > 10.1.188.249:48326] >> > at >> > >> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94) >> > at java.lang.Thread.run(Thread.java:619) >> >> >> Are you sure the regionserver error matches the datanode error? >> >> My understanding is that in 0.19.0, DFSClient in regionserver is supposed >> to >> reestablish timed-out connections. If that is not happening in your case >> -- >> and we've speculated some that there might holes in this mechanism -- try >> with timeout set to zero (see citation above; be sure the configuration >> can >> be seen by the DFSClient running in hbase by either adding to >> hbase-site.xml >> or somehow get the hadoop-site.xml into hbase CLASSPATH >> (hbase-env.sh#HBASE_CLASSPATH or with a symlink into the HBASE_HOME/conf >> dir). >> >> >> >> > Case 2: >> > >> > HBase Regionserver: >> > >> > 2009-03-02 09:55:11,929 WARN org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for block >> > blk_-6496095407839777264_96895java.io.IOException: Bad response 1 for >> block >> > blk_-6496095407839777264_96895 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-6496095407839777264_96895 bad datanode[1] >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-6496095407839777264_96895 in pipeline >> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad >> datanode >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for block >> > blk_-7585241287138805906_96914java.io.IOException: Bad response 1 for >> block >> > blk_-7585241287138805906_96914 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-7585241287138805906_96914 bad datanode[1] >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:14,363 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-7585241287138805906_96914 in pipeline >> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.141:50010: bad >> datanode >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:14,445 WARN org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for block >> > blk_8693483996243654850_96912java.io.IOException: Bad response 1 for >> block >> > blk_8693483996243654850_96912 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_8693483996243654850_96912 bad datanode[1] >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_8693483996243654850_96912 in pipeline >> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad >> datanode >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:14,923 WARN org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for block >> > blk_-8939308025013258259_96931java.io.IOException: Bad response 1 for >> block >> > blk_-8939308025013258259_96931 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-8939308025013258259_96931 bad datanode[1] >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-8939308025013258259_96931 in pipeline >> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad >> datanode >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for block >> > blk_7417692418733608681_96934java.io.IOException: Bad response 1 for >> block >> > blk_7417692418733608681_96934 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_7417692418733608681_96934 bad datanode[2] >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_7417692418733608681_96934 in pipeline >> > 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad >> datanode >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for block >> > blk_6777180223564108728_96939java.io.IOException: Bad response 1 for >> block >> > blk_6777180223564108728_96939 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_6777180223564108728_96939 bad datanode[1] >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_6777180223564108728_96939 in pipeline >> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad >> datanode >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for block >> > blk_-6352908575431276531_96948java.io.IOException: Bad response 1 for >> block >> > blk_-6352908575431276531_96948 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-6352908575431276531_96948 bad datanode[2] >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-6352908575431276531_96948 in pipeline >> > 10.1.188.249:50010, 10.1.188.30:50010, 10.1.188.182:50010: bad datanode >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:15,988 INFO >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: >> > MSG_REGION_SPLIT: metadata_table,r: >> > >> http://com.over-blog.www/_cdata/img/footer_mid....@20070505132942-20070505132942,1235761772185 >> > 2009-03-02< >> http://com.over-blog.www/_cdata/img/footer_mid....@20070505132942-20070505132942,1235761772185%0A2009-03-02>09:55:16,008 >> WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream >> > ResponseProcessor exception for block >> > blk_-1071965721931053111_96956java.io.IOException: Bad response 1 for >> block >> > blk_-1071965721931053111_96956 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-1071965721931053111_96956 bad datanode[2] >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:16,009 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-1071965721931053111_96956 in pipeline >> > 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad >> datanode >> > 10.1.188.182:50010 >> > 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for block >> > blk_1004039574836775403_96959java.io.IOException: Bad response 1 for >> block >> > blk_1004039574836775403_96959 from datanode 10.1.188.182:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) >> > >> > 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_1004039574836775403_96959 bad datanode[1] >> > 10.1.188.182:50010 >> > >> > >> > Hadoop datanode: >> > >> > 2009-03-02 09:55:10,201 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder >> > blk_-5472632607337755080_96875 1 Exception java.io.EOFException >> > at java.io.DataInputStream.readFully(DataInputStream.java:180) >> > at java.io.DataInputStream.readLong(DataInputStream.java:399) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833) >> > at java.lang.Thread.run(Thread.java:619) >> > >> > 2009-03-02 09:55:10,407 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for >> block >> > blk_-5472632607337755080_96875 terminating >> > 2009-03-02 09:55:10,516 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >> > 10.1.188.249:50010, >> storageID=DS-1180278657-127.0.0.1-50010-1235652659245, >> > infoPort=50075, ipcPort=50020):Exception writing block >> > blk_-5472632607337755080_96875 to mirror 10.1.188.182:50010 >> > java.io.IOException: Broken pipe >> > at sun.nio.ch.FileDispatcher.write0(Native Method) >> > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) >> > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104) >> > at sun.nio.ch.IOUtil.write(IOUtil.java:75) >> > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) >> > at >> > >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) >> > at >> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) >> > at java.io.DataOutputStream.write(DataOutputStream.java:90) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) >> > at java.lang.Thread.run(Thread.java:619) >> > >> > 2009-03-02 09:55:10,517 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in >> receiveBlock >> > for block blk_-5472632607337755080_96875 java.io.IOException: Broken >> pipe >> > 2009-03-02 09:55:10,517 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock >> > blk_-5472632607337755080_96875 received exception java.io.IOException: >> > Broken pipe >> > 2009-03-02 09:55:10,517 ERROR >> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >> > 10.1.188.249:50010, >> storageID=DS-1180278657-127.0.0.1-50010-1235652659245, >> > infoPort=50075, ipcPort=50020):DataXceiver >> > java.io.IOException: Broken pipe >> > at sun.nio.ch.FileDispatcher.write0(Native Method) >> > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) >> > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104) >> > at sun.nio.ch.IOUtil.write(IOUtil.java:75) >> > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) >> > at >> > >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) >> > at >> > >> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) >> > at >> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) >> > at java.io.DataOutputStream.write(DataOutputStream.java:90) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) >> > at java.lang.Thread.run(Thread.java:619) >> > 2009-03-02 09:55:11,174 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> > 10.1.188.249:49063, dest: /10.1.188.249:50010, bytes: 312, op: >> HDFS_WRITE, >> > cliID: DFSClient_1091437257, srvID: >> > DS-1180278657-127.0.0.1-50010-1235652659245, blockid: >> > blk_5027345212081735473_96878 >> > 2009-03-02 09:55:11,177 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for >> block >> > blk_5027345212081735473_96878 terminating >> > 2009-03-02 09:55:11,185 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block >> > blk_-3992843464553216223_96885 src: /10.1.188.249:49069 dest: / >> > 10.1.188.249:50010 >> > 2009-03-02 09:55:11,186 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block >> > blk_-3132070329589136987_96885 src: /10.1.188.30:33316 dest: / >> > 10.1.188.249:50010 >> > 2009-03-02 09:55:11,187 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in >> receiveBlock >> > for block blk_8782629414415941143_96845 java.io.IOException: Connection >> > reset by peer >> > 2009-03-02 09:55:11,187 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for >> block >> > blk_8782629414415941143_96845 Interrupted. >> > 2009-03-02 09:55:11,187 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for >> block >> > blk_8782629414415941143_96845 terminating >> > 2009-03-02 09:55:11,187 INFO >> > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock >> > blk_8782629414415941143_96845 received exception java.io.IOException: >> > Connection reset by peer >> > 2009-03-02 09:55:11,187 ERROR >> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >> > 10.1.188.249:50010, >> storageID=DS-1180278657-127.0.0.1-50010-1235652659245, >> > infoPort=50075, ipcPort=50020):DataXceiver >> > java.io.IOException: Connection reset by peer >> > at sun.nio.ch.FileDispatcher.read0(Native Method) >> > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) >> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) >> > at sun.nio.ch.IOUtil.read(IOUtil.java:206) >> > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >> > at >> > >> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) >> > at >> > >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >> > at >> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) >> > at >> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) >> > at >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >> > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) >> > at java.io.DataInputStream.read(DataInputStream.java:132) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) >> > at >> > >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) >> > at java.lang.Thread.run(Thread.java:619) >> > etc............................. >> >> >> >> This looks like an HDFS issue where it won't move on past the bad server >> 182. On client side, they are reported as WARN in the dfsclient but don't >> make it up to regionserver so not much we can do about it. >> >> >> I have others exceptions related to DataXceivers problems. These errors >> > doesn't make the region server go down, but I can see that I lost some >> > records (about 3.10e6 out of 160.10e6). >> > >> >> >> Any regionserver crashes during your upload? I'd think this more the >> reason >> for dataloss; i.e. edits that were in memcache didn't make it out to the >> filesystem because there is still no working flush in hdfs -- hopefully >> 0.21 >> hadoop... see HADOOP-4379.... (though your scenario 2 above looks like we >> could have handed hdfs the data but it dropped it anyways....) >> >> >> >> > >> > As you can see in my conf files, I up the dfs.datanode.max.xcievers to >> 8192 >> > as suggested from several mails. >> > And my ulimit -n is at 32768. >> >> >> Make sure you can see that above is for sure in place by looking at the >> head >> of your regionserver log on startup. >> >> >> >> > Do these problems come from my configuration, or my hardware ? >> > >> >> >> Lets do some more back and forth and make sure we have done all we can >> regards the software configuration. Its probably not hardware going by >> the >> above. >> >> Tell us more about your uploading process and your schema. Did all load? >> If so, on your 6 servers, how many regions? How did you verify how much >> was >> loaded? >> >> St.Ack >> > >
