[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546919#comment-15546919 ]
Xiao Chen commented on HDFS-6532: --------------------------------- Looked more into this. For failed cases, we see (copied from the 'select-timeout' attachment): {noformat} 2016-10-04 22:10:24,365 INFO hdfs.DFSOutputStream (DFSOutputStream.java:run(1114)) - ====== java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[closed]. 28459 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2247) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1015) {noformat} And for the success cases, we see: {noformat} 2016-10-04 15:13:15,271 INFO hdfs.DFSOutputStream (DFSOutputStream.java:run(1116)) - ====== java.io.IOException: Bad response ERROR for block BP-1283991366-172.16.3.181-1475619192335:blk_1073741825_1001 from datanode DatanodeInfoWithStorage[127.0.0.1:61321,DS-720243dd-55b6-49ef-ae55-4462e20260d5,DISK] at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1053) {noformat} and {noformat} 2016-10-04 15:13:16,084 INFO hdfs.DFSOutputStream (DFSOutputStream.java:run(1116)) - ====== java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2249) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1017) {noformat} I printed the exception from [this line|https://github.com/apache/hadoop/blob/44f48ee96ee6b2a3909911c37bfddb0c963d5ffc/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1149]. So in the failed cases, the responder is running in [this loop|https://github.com/apache/hadoop/blob/44f48ee96ee6b2a3909911c37bfddb0c963d5ffc/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L708], until the following exception is thrown {noformat}2016-10-04 22:36:40,403 INFO datanode.DataNode (BlockReceiver.java:receiveBlock(941)) - Exception for BP-2046749708-172.17.0.1-1475620536833:blk_1073741826_1005 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:42956 remote=/127.0.0.1:56324] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:502) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:900) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745) 2016-10-04 22:36:40,469 INFO hdfs.DFSOutputStream (DFSOutputStream.java:run(1116)) - ====== java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2249) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1017) {noformat} Afterwards the {{close}} call correctly returns and the test passes. Not sure how we can interrupt early in this case. Since there's no impact on correctness, maybe we should just add the test timeout. [~kihwal], could you share your thoughts on this? Thanks a lot. > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > ------------------------------------------------------------------------------------------ > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client > Affects Versions: 2.4.0 > Reporter: Yongjun Zhang > Assignee: Yiqun Lin > Attachments: HDFS-6532.001.patch, HDFS-6532.002.patch, > PreCommit-HDFS-Build #16770 test - testCorruptionDuringWrt [Jenkins].pdf, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption-select_timeout.xml, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 50000 milliseconds > Stacktrace > java.lang.Exception: test timed out after 50000 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org