[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445223#comment-16445223 ] Lars Francke commented on HDFS-6532: Okay...I'm sorry you can ignore my "noise": Turns out this is Spark Speculative Execution killing tasks. > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client >Affects Versions: 2.4.0 >Reporter: Yongjun Zhang >Assignee: Yiqun Lin >Priority: Major > Attachments: HDFS-6532.001.patch, HDFS-6532.002.patch, > PreCommit-HDFS-Build #16770 test - testCorruptionDuringWrt [Jenkins].pdf, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption-select_timeout.xml, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml, jstack > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 5 milliseconds > Stacktrace > java.lang.Exception: test timed out after 5 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445195#comment-16445195 ] Lars Francke commented on HDFS-6532: Hi, I know this is old but we're seeing this same error message on a production cluster and are a bit confused by it as well. Do you happen to have any more information or ideas on the root cause? This is from Spark writing to HDFS. And Spark is killing tasks with that same exception (see blow). Looking at the code I also don't know why things would be interrupted there. The DataNode logs look normal to me at the same time (unfortunately for those I don't have the verbatim logs): 01:12:26 - Receiving Block 01:13:17 - Thread is interrupted 01:13:17 - Terminating 01:13:17 - Premature EOF from inputStream {code:java} 18/04/20 01:12:29 INFO Executor: Executor is trying to kill task 66.0 in stage 231.0 (TID 204526) 18/04/20 01:12:29 INFO DFSClient: Exception in createBlockOutputStream java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/10.194.211.44:52770 remote=/10.194.211.44:1019]. 215000 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2462) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1461) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1380) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:558) 18/04/20 01:12:29 INFO DFSClient: Abandoning BP-188726-10.194.210.65-1478836813700:blk_5197463151_4124323282 18/04/20 01:12:29 WARN Client: interrupted waiting to send rpc request to server java.lang.InterruptedException at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1094) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy12.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:436) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185) at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1384) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:558) 18/04/20 01:12:29 WARN DFSClient: DataStreamer Exception java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.ipc.Client.call(Client.java:1463) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy12.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:436) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185) at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutput
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550368#comment-15550368 ] Xiao Chen commented on HDFS-6532: - Thanks for the comment, [~linyiqun]. Took a jstack when right before the test timed out. It has: {noformat} "ResponseProcessor for block BP-341806944-172.17.0.1-1475696115790:blk_1073741825_1001" java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2247) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1017) "PacketResponder: BP-341806944-172.17.0.1-1475696115790:blk_1073741825_1001, type=HAS_DOWNSTREAM_IN_PIPELINE" java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2247) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1303) at java.lang.Thread.run(Thread.java:745) "DataXceiver for client DFSClient_NONMAPREDUCE_27347732_8 at /127.0.0.1:36783 [Receiving block BP-341806944-172.17.0.1-1475696115790:blk_1073741825_1001]" java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:502) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:900) at org.apache.hadoop.hdfs.server.datanod
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15547303#comment-15547303 ] Yiqun Lin commented on HDFS-6532: - Thanks [~xiaochen] for continous work on this. Some comments from me: 1.{quote} maybe we should just add the test timeout {quote} I am +0 for increasing the timeout since that I think this seems not the best way. 2.As the comments said, the socket timeout happens when the test runs failed. Here the socket timeout is normal? Or there is some other exception to trigger this? > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client >Affects Versions: 2.4.0 >Reporter: Yongjun Zhang >Assignee: Yiqun Lin > Attachments: HDFS-6532.001.patch, HDFS-6532.002.patch, > PreCommit-HDFS-Build #16770 test - testCorruptionDuringWrt [Jenkins].pdf, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption-select_timeout.xml, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 5 milliseconds > Stacktrace > java.lang.Exception: test timed out after 5 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546919#comment-15546919 ] Xiao Chen commented on HDFS-6532: - Looked more into this. For failed cases, we see (copied from the 'select-timeout' attachment): {noformat} 2016-10-04 22:10:24,365 INFO hdfs.DFSOutputStream (DFSOutputStream.java:run(1114)) - == java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[closed]. 28459 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2247) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1015) {noformat} And for the success cases, we see: {noformat} 2016-10-04 15:13:15,271 INFO hdfs.DFSOutputStream (DFSOutputStream.java:run(1116)) - == java.io.IOException: Bad response ERROR for block BP-1283991366-172.16.3.181-1475619192335:blk_1073741825_1001 from datanode DatanodeInfoWithStorage[127.0.0.1:61321,DS-720243dd-55b6-49ef-ae55-4462e20260d5,DISK] at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1053) {noformat} and {noformat} 2016-10-04 15:13:16,084 INFO hdfs.DFSOutputStream (DFSOutputStream.java:run(1116)) - == java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2249) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1017) {noformat} I printed the exception from [this line|https://github.com/apache/hadoop/blob/44f48ee96ee6b2a3909911c37bfddb0c963d5ffc/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1149]. So in the failed cases, the responder is running in [this loop|https://github.com/apache/hadoop/blob/44f48ee96ee6b2a3909911c37bfddb0c963d5ffc/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L708], until the following exception is thrown {noformat}2016-10-04 22:36:40,403 INFO datanode.DataNode (BlockReceiver.java:receiveBlock(941)) - Exception for BP-2046749708-172.17.0.1-1475620536833:blk_1073741826_1005 java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:42956 remote=/127.0.0.1:56324] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:502) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:900) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745) 2016-10-04 22:36:40,469 INFO hdfs.DFSOutputStream (DFSOutputStream.java:run(1116
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546854#comment-15546854 ] Hadoop QA commented on HDFS-6532: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 82m 4s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}105m 18s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure | | | hadoop.hdfs.server.namenode.TestFileTruncate | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:9560f25 | | JIRA Issue | HDFS-6532 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12831614/HDFS-6532.002.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 44c223b96a4a 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 44f48ee | | Default Java | 1.8.0_101 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/17007/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/17007/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/17007/console | | Powered by | Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15451120#comment-15451120 ] Xiao Chen commented on HDFS-6532: - Yep, sadly I'm not able to locally reproduce this at all, either with upstream or cdh. The log I attached is from a CDH code base, where I could use [~andrew.wang]'s [dist_test|http://blog.cloudera.com/blog/2016/05/quality-assurance-at-cloudera-distributed-unit-testing/] to reproduce this. (So far dist_test doesn't work with upstream yet.) Feel free to attach here if you're able to get a failure log. Thanks. > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client >Affects Versions: 2.4.0 >Reporter: Yongjun Zhang >Assignee: Yiqun Lin > Attachments: HDFS-6532.001.patch, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 5 milliseconds > Stacktrace > java.lang.Exception: test timed out after 5 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450833#comment-15450833 ] Yiqun Lin commented on HDFS-6532: - Thanks [~xiaochen] for the comment. {quote} It does look like we can reuse the closeResponder method in the loop {quote} Agreed. I have filed a new JIRA HDFS-10820 for tracking that. I think we are closing. I'd like to found more clues from failure logs, but it seems the failure log that attached was based on the old codes? Right? > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client >Affects Versions: 2.4.0 >Reporter: Yongjun Zhang >Assignee: Yiqun Lin > Attachments: HDFS-6532.001.patch, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 5 milliseconds > Stacktrace > java.lang.Exception: test timed out after 5 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449965#comment-15449965 ] Xiao Chen commented on HDFS-6532: - Thanks Yiqun for working on this. It does look like we can reuse the {{closeResponder}} method in the loop, but I don't think that's the root cause here. Taking the failure log in attachment as an example, the test is supposed to end quickly (around 15:41:58) after 5 times failure on checksum error. But somehow it did not, and hangs there until the 50 seconds test timeout is reached. After test timeout, junit interrupts all threads which is what we see in the last 3 messages (around 15:42:43). I looked into this too, and still think this is some error on triggering / handling the interrupt after the 5th checksum error. Don't have any concrete progress though. {noformat} 2016-08-20 15:41:58,084 INFO datanode.DataNode (DataXceiver.java:writeBlock(835)) - opWriteBlock BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 received exception java.io.IOException: Terminating due to a checksum error.java.io.IOException: Unexpected checksum mismatch while writing BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 from /127.0.0.1:49059 2016-08-20 15:41:58,084 ERROR datanode.DataNode (DataXceiver.java:run(273)) - 127.0.0.1:52977:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:49059 dst: /127.0.0.1:52977 java.io.IOException: Terminating due to a checksum error.java.io.IOException: Unexpected checksum mismatch while writing BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 from /127.0.0.1:49059 at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:606) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745) 2016-08-20 15:41:58,258 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 127.0.0.1:51819 to delete [blk_1073741825_1002] 2016-08-20 15:41:58,258 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 127.0.0.1:39731 to delete [blk_1073741825_1002] 2016-08-20 15:41:58,258 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 127.0.0.1:52977 to delete [blk_1073741825_1002] 2016-08-20 15:41:59,235 INFO BlockStateChange (InvalidateBlocks.java:add(116)) - BLOCK* InvalidateBlocks: add blk_1073741825_1001 to 127.0.0.1:49498 2016-08-20 15:41:59,238 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 for deletion 2016-08-20 15:41:59,240 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(295)) - Deleted BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 2016-08-20 15:41:59,378 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data9/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 for deletion 2016-08-20 15:41:59,378 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(295)) - Deleted BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data9/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 2016-08-20 15:41:59,698 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data17/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 for deletion 2016-08-20 15:41:59,698 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(295)) - Deleted BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file /tmp/run_tha_test5KJc
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15448770#comment-15448770 ] Yiqun Lin commented on HDFS-6532: - I looked into this issue again and I might find the root cause. As [~kihwal] had mentioned, the failed case will not print the following infos {code} (TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception java.io.IOException: Failing write. Tried pipeline recovery 5 times without success. {code} That means the program has returned before do the recover pipeline operations sometimes. The related codes: {code:title=DataStreamer.java|borderStyle=solid} private boolean processDatanodeOrExternalError() throws IOException { if (!errorState.hasDatanodeError() && !shouldHandleExternalError()) { return false; } LOG.debug("start process datanode/external error, {}", this); // If the response has not closed, this method will just return if (response != null) { LOG.info("Error Recovery for " + block + " waiting for responder to exit. "); return true; } closeStream(); ... {code} I looked into the code and I thought there was a bug to cause that, the related codes: {code:title=DataStreamer.java|borderStyle=solid} public void run() { long lastPacket = Time.monotonicNow(); TraceScope scope = null; while (!streamerClosed && dfsClient.clientRunning) { // if the Responder encountered an error, shutdown Responder if (errorState.hasError() && response != null) { try { response.close(); response.join(); response = null; } catch (InterruptedException e) { // If interruptedException happens, the response will not be set to null LOG.warn("Caught exception", e); } } // Here need add a finally block to set response as null ... {code} I think we should move the line {{response = null;}} into {{finally}} block. Finally attach a patch for this. This test has failed intermitly for a long time, hope my patch can make sense. Softly ping [~xiaochen], [~kihwal] and [~yzhangal] for the comments. Thanks. > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client >Affects Versions: 2.4.0 >Reporter: Yongjun Zhang > Attachments: TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 5 milliseconds > Stacktrace > java.lang.Exception: test timed out after 5 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Rec
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413411#comment-15413411 ] Yiqun Lin commented on HDFS-6532: - I looked into the logs info when this test failed, they both showed these stack infos: {code} BP-1186421078-172.17.0.2-1470312073795:blk_1073741826_1006] WARN hdfs.DataStreamer (DataStreamer.java:closeResponder(873)) - Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1245) at java.lang.Thread.join(Thread.java:1319) at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:871) at org.apache.hadoop.hdfs.DataStreamer.closeInternal(DataStreamer.java:733) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:729) 2016-08-04 12:02:02,523 [Thread-0] INFO hdfs.DFSClient (TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception java.io.InterruptedIOException: Interrupted while waiting for data to be acknowledged by pipeline at org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:775) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:697) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:778) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:755) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) at org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:136) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {code} But now I am not sure that why InterruptedException happens intermittently. In addition, I found that if InterruptedException was threw when the program did the {{dataQueue.wait()}}, then it will lead the files not completely closed in {{DFSClient#closeAllFilesBeingWritten}}. This issue was tracked by HDFS-10549. I thinks this two issue was related. > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client >Affects Versions: 2.4.0 >Reporter: Yongjun Zhang > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 5 milliseconds > Stacktrace > java.lang.Exception: test timed out after 5 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.Dat
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408077#comment-15408077 ] Kihwal Lee commented on HDFS-6532: -- Still sometimes failing in trunk the same way. When it is working, {{close()}} should fail. {noformat} 2016-08-04 11:10:38,293 [Thread-0] INFO hdfs.DFSClient (TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception java.io.IOException: Failing write. Tried pipeline recovery 5 times without success. at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1128) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:552) {noformat} In the failed case, pipeline recovery only happened twice. {{DataStreamer}} usually directly notices a problem or {{ResponseProcessor}} hints it. It looks like the datanode thread tried to terminate, but the connection was not closed. > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client >Affects Versions: 2.4.0 >Reporter: Yongjun Zhang > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 5 milliseconds > Stacktrace > java.lang.Exception: test timed out after 5 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195362#comment-15195362 ] Kihwal Lee commented on HDFS-6532: -- Still happening. {noformat} testCorruptionDuringWrt(org.apache.hadoop.hdfs.TestCrcCorruption) Time elapsed: 50.284 sec <<< ERROR! java.lang.Exception: test timed out after 5 milliseconds at java.lang.Object.wait(Native Method) at org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:764) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:689) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:770) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:747) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) at org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:136) {noformat} > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > -- > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client >Affects Versions: 2.4.0 >Reporter: Yongjun Zhang > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 5 milliseconds > Stacktrace > java.lang.Exception: test timed out after 5 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)