[ https://issues.apache.org/jira/browse/HDFS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200372#comment-13200372 ]
Uma Maheswara Rao G commented on HDFS-2891: ------------------------------------------- More info: This happens when we test with Hbase. Run the HDFS and HBase cluster normally, and Suddenly Power-off the mid Datanode from the pipeline. Then Clinet was getting following exception. {noformat} [2012-01-31 11:15:42,596] [WARN ] [ResponseProcessor for block blk_1327946241860_1109] [org.apache.hadoop.hdfs.DFSClient 3287] DFSOutputStream ResponseProcessor exception for block blk_1327946241860_1109java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/XXX:59179 remote=/FIRST_DATANODE_IP:10010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:167) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:131) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3240) {noformat} After that client closed the connection and first Datanode was getting EOFException and interrupted the threads. {noformat} [2012-01-31 11:15:42,597] [INFO ] [org.apache.hadoop.hdfs.server.datanode.DataXceiver@3e909a58] [org.apache.hadoop.hdfs.server.datanode.DataNode 816] Exception in receiveBlock for block blk_1327946241860_1109 java.io.EOFException: while trying to read 31744 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:352) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:399) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:611) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:514) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:138) at java.lang.Thread.run(Thread.java:662) [2012-01-31 11:15:42,598] [INFO ] [PacketResponder 2 for Block blk_1327946241860_1109] [org.apache.hadoop.hdfs.server.datanode.DataNode 1123] PacketResponder blk_1327946241860_1109 2 : Thread is interrupted. [2012-01-31 11:15:42,598] [INFO ] [PacketResponder 2 for Block blk_1327946241860_1109] [org.apache.hadoop.hdfs.server.datanode.DataNode 1194] PacketResponder 2 for block blk_1327946241860_1109 terminating {noformat} Finally Client is marking this first DataNode as bad. {noformat} [2012-01-31 11:15:42,597] [WARN ] [DataStreamer for file /hbase/.logs/DDB01,20020,1327946260020/DDB01%3A20020.1327978683479 block blk_1327946241860_1109] [org.apache.hadoop.hdfs.DFSClient 3326] Error Recovery for block blk_1327946241860_1109 bad datanode[0] FIRST_DATANODE_IP:10010 [2012-01-31 11:15:42,597] [WARN ] [DataStreamer for file /hbase/.logs/DDB01,20020,1327946260020/DDB01%3A20020.1327978683479 block blk_1327946241860_1109] [org.apache.hadoop.hdfs.DFSClient 3380] Error Recovery for block blk_1327946241860_1109 in pipeline FIRST_DATANODE_IP:10010, SECOND_DATANODE_IP:10010, THIRD_DATANODE_IP:10010: bad datanode FIRST_DATANODE_IP:10010 [2012-01-31 11:15:46,607] [INFO ] [DataStreamer for file /hbase/.logs/DDB01,20020,1327946260020/DDB01%3A20020.1327978683479 block blk_1327946241860_1109] [org.apache.hadoop.ipc.Client 514] Retrying connect to server: /SECOND_DATANODE_IP:10020. Already tried 0 time(s). {noformat} Infact first datanode is healthy, but unportunately due to this timeouts clinet detected first datanode as bad. Then immediately it is trying with the second datanode( by choosing it as primary node). Ideally first Datanode should get first timeout because he will wait only 66000ms for ack response from second datanode(power off DN). Here clients waits for 69000ms. But for some reason Client is getting first time out exception that firstDatanode. So, first DN marked as bad instaed of second one. Impact is, Clinet again tries for the second datanode and obiously will fail again.It will 6 times unnecessarily it will retry. {noformat} 2012-01-31 11:19:58,565] [WARN ] [DataStreamer for file /hbase/.logs/DDB01,20020,1327946260020/DDB01%3A20020.1327978683479 block blk_1327946241860_1109] [org.apache.hadoop.hdfs.DFSClient 3426] Error Recovery for block blk_1327946241860_1109 failed because recovery from primary datanode SECOND_DATANODE_IP:10010 failed 6 times. Pipeline was FIRST_DATANODE_IP:10010, SECOND_DATANODE_IP:10010, THIRD_DATANODE_IP:10010. Marking primary datanode as bad. {noformat} Finally it will end up in writing only one replica in third DataNode. But here we have two nodes healthy. > Some times first DataNode detected as bad when we power off for the second > DataNode. > ------------------------------------------------------------------------------------ > > Key: HDFS-2891 > URL: https://issues.apache.org/jira/browse/HDFS-2891 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client > Affects Versions: 1.1.0 > Reporter: Uma Maheswara Rao G > > In one of my clusters, observed this situation. > This issue looks to be due to time out in ResponseProcesser at client side, > it is marking first DataNode as bad. > This happens in 20.2 version. This can be there in branch-1 as well and will > check for trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira