Thanks Andrew.I will set "dfs.datanode.max.xcievers=1024" (default is 256)
I am using branch-0.19. Do you think "dfs.datanode.socket.write.timeout=0" is necessary in release-0.19? Schubert On Thu, Mar 26, 2009 at 7:57 AM, Andrew Purtell <[email protected]> wrote: > > You may need to increase the maximum number of xceivers allowed > on each of your datanodes. > > Best regards, > > - Andy > > > From: schubert zhang <[email protected]> > > Subject: Re: Data lost during intensive writes > > To: [email protected] > > Date: Wednesday, March 25, 2009, 2:01 AM > > Hi all, > > I also meet such same problems/exceptions. > > I also have 5+1 machine,e and the system has been running > > for about 4 days, > > and there are 512 regions now. But the two > > exceptions start to happen earlyer. > > > > hadoop-0.19 > > hbase-0.19.1 (with patch > > https://issues.apache.org/jira/browse/HBASE-1008).< > https://issues.apache.org/jira/browse/HBASE-1008> > > > > I want to try to set dfs.datanode.socket.write.timeout=0 > > and watch it later. > > > > Schubert > > > > On Sat, Mar 7, 2009 at 3:15 AM, stack > > <[email protected]> wrote: > > > > > On Wed, Mar 4, 2009 at 9:18 AM, > > <[email protected]> wrote: > > > > > > > <property> > > > > <name>dfs.replication</name> > > > > <value>2</value> > > > > <description>Default block replication. > > > > The actual number of replications can be > > specified when the file is > > > > created. > > > > The default is used if replication is not > > specified in create time. > > > > </description> > > > > </property> > > > > > > > > <property> > > > > <name>dfs.block.size</name> > > > > <value>8388608</value> > > > > <description>The hbase standard size for > > new files.</description> > > > > <!--<value>67108864</value>--> > > > > <!--<description>The default block size > > for new files.</description>--> > > > > </property> > > > > > > > > > > > > > The above are non-standard. A replication of 3 might > > lessen the incidence > > > of HDFS errors seen since there will be another > > replica to go to. Why > > > non-standard block size? > > > > > > I did not see *dfs.datanode.socket.write.timeout* set > > to 0. Is that > > > because > > > you are running w/ 0.19.0? You might try with it > > especially because in the > > > below I see complaint about the timeout (but more > > below on this). > > > > > > > > > > > > > <property> > > > > > > <name>hbase.hstore.blockCache.blockSize</name> > > > > <value>65536</value> > > > > <description>The size of each block in > > the block cache. > > > > Enable blockcaching on a per column family > > basis; see the BLOCKCACHE > > > > setting > > > > in HColumnDescriptor. Blocks are kept in a > > java Soft Reference cache > > > so > > > > are > > > > let go when high pressure on memory. Block > > caching is not enabled by > > > > default. > > > > Default is 16384. > > > > </description> > > > > </property> > > > > > > > > > > > > > Are you using blockcaching? If so, 64k was > > problematic in my testing > > > (OOMEing). > > > > > > > > > > > > > > > > Case 1: > > > > > > > > On HBase Regionserver: > > > > > > > > 2009-02-27 04:23:52,185 INFO > > org.apache.hadoop.hdfs.DFSClient: > > > > org.apache.hadoop.ipc.RemoteException: > > > > > > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: > > Not > > > > replicated > > > > > > > > > > yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351) > > > > at > > sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) > > > > at > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > at > > java.lang.reflect.Method.invoke(Method.java:597) > > > > at > > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) > > > > at > > org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) > > > > > > > > at > > org.apache.hadoop.ipc.Client.call(Client.java:696) > > > > at > > org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) > > > > at $Proxy1.addBlock(Unknown Source) > > > > at > > sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > > > > at > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > at > > java.lang.reflect.Method.invoke(Method.java:597) > > > > at > > > > > > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > > > > at > > > > > > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > > > > at $Proxy1.addBlock(Unknown Source) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) > > > > > > > > > > > > On Hadoop Datanode: > > > > > > > > 2009-02-27 04:22:58,110 WARN > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration( > > > > 10.1.188.249:50010, > > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245, > > > > infoPort=50075, ipcPort=50020):Got exception > > while serving > > > > blk_5465578316105624003_26301 to /10.1.188.249: > > > > java.net.SocketTimeoutException: 480000 millis > > timeout while waiting for > > > > channel to be ready for write. ch : > > > > java.nio.channels.SocketChannel[connected > > local=/10.1.188.249:50010 > > > remote=/ > > > > 10.1.188.249:48326] > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94) > > > > at java.lang.Thread.run(Thread.java:619) > > > > > > > > 2009-02-27 04:22:58,110 ERROR > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration( > > > > 10.1.188.249:50010, > > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245, > > > > infoPort=50075, ipcPort=50020):DataXceiver > > > > java.net.SocketTimeoutException: 480000 millis > > timeout while waiting for > > > > channel to be ready for write. ch : > > > > java.nio.channels.SocketChannel[connected > > local=/10.1.188.249:50010 > > > remote=/ > > > > 10.1.188.249:48326] > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94) > > > > at java.lang.Thread.run(Thread.java:619) > > > > > > > > > Are you sure the regionserver error matches the > > datanode error? > > > > > > My understanding is that in 0.19.0, DFSClient in > > regionserver is supposed > > > to > > > reestablish timed-out connections. If that is not > > happening in your case > > > -- > > > and we've speculated some that there might holes > > in this mechanism -- try > > > with timeout set to zero (see citation above; be sure > > the configuration can > > > be seen by the DFSClient running in hbase by either > > adding to > > > hbase-site.xml > > > or somehow get the hadoop-site.xml into hbase > > CLASSPATH > > > (hbase-env.sh#HBASE_CLASSPATH or with a symlink into > > the HBASE_HOME/conf > > > dir). > > > > > > > > > > > > > Case 2: > > > > > > > > HBase Regionserver: > > > > > > > > 2009-03-02 09:55:11,929 WARN > > org.apache.hadoop.hdfs.DFSClient: > > > > DFSOutputStream ResponseProcessor exception for > > block > > > > > > blk_-6496095407839777264_96895java.io.IOException: Bad > > response 1 for > > > block > > > > blk_-6496095407839777264_96895 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:11,930 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-6496095407839777264_96895 > > bad datanode[1] > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:11,930 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-6496095407839777264_96895 > > in pipeline > > > > 10.1.188.249:50010, 10.1.188.182:50010, > > 10.1.188.203:50010: bad datanode > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:14,362 WARN > > org.apache.hadoop.hdfs.DFSClient: > > > > DFSOutputStream ResponseProcessor exception for > > block > > > > > > blk_-7585241287138805906_96914java.io.IOException: Bad > > response 1 for > > > block > > > > blk_-7585241287138805906_96914 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:14,362 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-7585241287138805906_96914 > > bad datanode[1] > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:14,363 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-7585241287138805906_96914 > > in pipeline > > > > 10.1.188.249:50010, 10.1.188.182:50010, > > 10.1.188.141:50010: bad datanode > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:14,445 WARN > > org.apache.hadoop.hdfs.DFSClient: > > > > DFSOutputStream ResponseProcessor exception for > > block > > > > blk_8693483996243654850_96912java.io.IOException: > > Bad response 1 for > > > block > > > > blk_8693483996243654850_96912 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:14,446 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_8693483996243654850_96912 > > bad datanode[1] > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:14,446 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_8693483996243654850_96912 > > in pipeline > > > > 10.1.188.249:50010, 10.1.188.182:50010, > > 10.1.188.203:50010: bad datanode > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:14,923 WARN > > org.apache.hadoop.hdfs.DFSClient: > > > > DFSOutputStream ResponseProcessor exception for > > block > > > > > > blk_-8939308025013258259_96931java.io.IOException: Bad > > response 1 for > > > block > > > > blk_-8939308025013258259_96931 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:14,935 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-8939308025013258259_96931 > > bad datanode[1] > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:14,935 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-8939308025013258259_96931 > > in pipeline > > > > 10.1.188.249:50010, 10.1.188.182:50010, > > 10.1.188.203:50010: bad datanode > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:15,344 WARN > > org.apache.hadoop.hdfs.DFSClient: > > > > DFSOutputStream ResponseProcessor exception for > > block > > > > blk_7417692418733608681_96934java.io.IOException: > > Bad response 1 for > > > block > > > > blk_7417692418733608681_96934 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:15,344 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_7417692418733608681_96934 > > bad datanode[2] > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:15,344 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_7417692418733608681_96934 > > in pipeline > > > > 10.1.188.249:50010, 10.1.188.203:50010, > > 10.1.188.182:50010: bad datanode > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:15,579 WARN > > org.apache.hadoop.hdfs.DFSClient: > > > > DFSOutputStream ResponseProcessor exception for > > block > > > > blk_6777180223564108728_96939java.io.IOException: > > Bad response 1 for > > > block > > > > blk_6777180223564108728_96939 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:15,579 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_6777180223564108728_96939 > > bad datanode[1] > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:15,579 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_6777180223564108728_96939 > > in pipeline > > > > 10.1.188.249:50010, 10.1.188.182:50010, > > 10.1.188.203:50010: bad datanode > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:15,930 WARN > > org.apache.hadoop.hdfs.DFSClient: > > > > DFSOutputStream ResponseProcessor exception for > > block > > > > > > blk_-6352908575431276531_96948java.io.IOException: Bad > > response 1 for > > > block > > > > blk_-6352908575431276531_96948 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:15,930 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-6352908575431276531_96948 > > bad datanode[2] > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:15,930 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-6352908575431276531_96948 > > in pipeline > > > > 10.1.188.249:50010, 10.1.188.30:50010, > > 10.1.188.182:50010: bad datanode > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:15,988 INFO > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > > > > MSG_REGION_SPLIT: metadata_table,r: > > > > > > > > > > http://com.over-blog.www/_cdata/img/footer_mid....@20070505132942-20070505132942,1235761772185 > > > > 2009-03-02< > > > > > > http://com.over-blog.www/_cdata/img/footer_mid....@20070505132942-20070505132942,1235761772185%0A2009-03-02 > >09:55:16,008 > > > WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream > > > > ResponseProcessor exception for block > > > > > > blk_-1071965721931053111_96956java.io.IOException: Bad > > response 1 for > > > block > > > > blk_-1071965721931053111_96956 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:16,008 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-1071965721931053111_96956 > > bad datanode[2] > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:16,009 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_-1071965721931053111_96956 > > in pipeline > > > > 10.1.188.249:50010, 10.1.188.203:50010, > > 10.1.188.182:50010: bad datanode > > > > 10.1.188.182:50010 > > > > 2009-03-02 09:55:16,073 WARN > > org.apache.hadoop.hdfs.DFSClient: > > > > DFSOutputStream ResponseProcessor exception for > > block > > > > blk_1004039574836775403_96959java.io.IOException: > > Bad response 1 for > > > block > > > > blk_1004039574836775403_96959 from datanode > > 10.1.188.182:50010 > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342) > > > > > > > > 2009-03-02 09:55:16,073 WARN > > org.apache.hadoop.hdfs.DFSClient: Error > > > > Recovery for block blk_1004039574836775403_96959 > > bad datanode[1] > > > > 10.1.188.182:50010 > > > > > > > > > > > > Hadoop datanode: > > > > > > > > 2009-03-02 09:55:10,201 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > PacketResponder > > > > blk_-5472632607337755080_96875 1 Exception > > java.io.EOFException > > > > at > > java.io.DataInputStream.readFully(DataInputStream.java:180) > > > > at > > java.io.DataInputStream.readLong(DataInputStream.java:399) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833) > > > > at java.lang.Thread.run(Thread.java:619) > > > > > > > > 2009-03-02 09:55:10,407 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > PacketResponder 1 for > > > block > > > > blk_-5472632607337755080_96875 terminating > > > > 2009-03-02 09:55:10,516 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration( > > > > 10.1.188.249:50010, > > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245, > > > > infoPort=50075, ipcPort=50020):Exception writing > > block > > > > blk_-5472632607337755080_96875 to mirror > > 10.1.188.182:50010 > > > > java.io.IOException: Broken pipe > > > > at sun.nio.ch.FileDispatcher.write0(Native > > Method) > > > > at > > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) > > > > at > > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104) > > > > at sun.nio.ch.IOUtil.write(IOUtil.java:75) > > > > at > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > > > > at > > > > > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > > > > at > > java.io.DataOutputStream.write(DataOutputStream.java:90) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) > > > > at java.lang.Thread.run(Thread.java:619) > > > > > > > > 2009-03-02 09:55:10,517 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > Exception in > > > receiveBlock > > > > for block blk_-5472632607337755080_96875 > > java.io.IOException: Broken pipe > > > > 2009-03-02 09:55:10,517 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > writeBlock > > > > blk_-5472632607337755080_96875 received exception > > java.io.IOException: > > > > Broken pipe > > > > 2009-03-02 09:55:10,517 ERROR > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration( > > > > 10.1.188.249:50010, > > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245, > > > > infoPort=50075, ipcPort=50020):DataXceiver > > > > java.io.IOException: Broken pipe > > > > at sun.nio.ch.FileDispatcher.write0(Native > > Method) > > > > at > > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) > > > > at > > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104) > > > > at sun.nio.ch.IOUtil.write(IOUtil.java:75) > > > > at > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > > > > at > > > > > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > > > > at > > java.io.DataOutputStream.write(DataOutputStream.java:90) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) > > > > at java.lang.Thread.run(Thread.java:619) > > > > 2009-03-02 09:55:11,174 INFO > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: > > src: / > > > > 10.1.188.249:49063, dest: /10.1.188.249:50010, > > bytes: 312, op: > > > HDFS_WRITE, > > > > cliID: DFSClient_1091437257, srvID: > > > > DS-1180278657-127.0.0.1-50010-1235652659245, > > blockid: > > > > blk_5027345212081735473_96878 > > > > 2009-03-02 09:55:11,177 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > PacketResponder 2 for > > > block > > > > blk_5027345212081735473_96878 terminating > > > > 2009-03-02 09:55:11,185 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > Receiving block > > > > blk_-3992843464553216223_96885 src: > > /10.1.188.249:49069 dest: / > > > > 10.1.188.249:50010 > > > > 2009-03-02 09:55:11,186 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > Receiving block > > > > blk_-3132070329589136987_96885 src: > > /10.1.188.30:33316 dest: / > > > > 10.1.188.249:50010 > > > > 2009-03-02 09:55:11,187 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > Exception in > > > receiveBlock > > > > for block blk_8782629414415941143_96845 > > java.io.IOException: Connection > > > > reset by peer > > > > 2009-03-02 09:55:11,187 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > PacketResponder 0 for > > > block > > > > blk_8782629414415941143_96845 Interrupted. > > > > 2009-03-02 09:55:11,187 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > PacketResponder 0 for > > > block > > > > blk_8782629414415941143_96845 terminating > > > > 2009-03-02 09:55:11,187 INFO > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > writeBlock > > > > blk_8782629414415941143_96845 received exception > > java.io.IOException: > > > > Connection reset by peer > > > > 2009-03-02 09:55:11,187 ERROR > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration( > > > > 10.1.188.249:50010, > > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245, > > > > infoPort=50075, ipcPort=50020):DataXceiver > > > > java.io.IOException: Connection reset by peer > > > > at sun.nio.ch.FileDispatcher.read0(Native > > Method) > > > > at > > sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > > > > at > > sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) > > > > at sun.nio.ch.IOUtil.read(IOUtil.java:206) > > > > at > > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) > > > > at > > > > > > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) > > > > at > > > > > > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) > > > > at > > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > > > at > > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > > > at > > java.io.DataInputStream.read(DataInputStream.java:132) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) > > > > at java.lang.Thread.run(Thread.java:619) > > > > etc............................. > > > > > > > > > > > > This looks like an HDFS issue where it won't move > > on past the bad server > > > 182. On client side, they are reported as WARN in the > > dfsclient but don't > > > make it up to regionserver so not much we can do about > > it. > > > > > > > > > I have others exceptions related to DataXceivers > > problems. These errors > > > > doesn't make the region server go down, but I > > can see that I lost some > > > > records (about 3.10e6 out of 160.10e6). > > > > > > > > > > > > > Any regionserver crashes during your upload? I'd > > think this more the > > > reason > > > for dataloss; i.e. edits that were in memcache > > didn't make it out to the > > > filesystem because there is still no working flush in > > hdfs -- hopefully > > > 0.21 > > > hadoop... see HADOOP-4379.... (though your scenario 2 > > above looks like we > > > could have handed hdfs the data but it dropped it > > anyways....) > > > > > > > > > > > > > > > > > As you can see in my conf files, I up the > > dfs.datanode.max.xcievers to > > > 8192 > > > > as suggested from several mails. > > > > And my ulimit -n is at 32768. > > > > > > > > > Make sure you can see that above is for sure in place > > by looking at the > > > head > > > of your regionserver log on startup. > > > > > > > > > > > > > Do these problems come from my configuration, or > > my hardware ? > > > > > > > > > > > > > Lets do some more back and forth and make sure we have > > done all we can > > > regards the software configuration. Its probably not > > hardware going by the > > > above. > > > > > > Tell us more about your uploading process and your > > schema. Did all load? > > > If so, on your 6 servers, how many regions? How did > > you verify how much > > > was > > > loaded? > > > > > > St.Ack > > > > > > >
