[
https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449965#comment-15449965
]
Xiao Chen commented on HDFS-6532:
---------------------------------
Thanks Yiqun for working on this. It does look like we can reuse the
{{closeResponder}} method in the loop, but I don't think that's the root cause
here.
Taking the failure log in attachment as an example, the test is supposed to end
quickly (around 15:41:58) after 5 times failure on checksum error. But somehow
it did not, and hangs there until the 50 seconds test timeout is reached. After
test timeout, junit interrupts all threads which is what we see in the last 3
messages (around 15:42:43).
I looked into this too, and still think this is some error on triggering /
handling the interrupt after the 5th checksum error. Don't have any concrete
progress though.
{noformat}
2016-08-20 15:41:58,084 INFO datanode.DataNode
(DataXceiver.java:writeBlock(835)) - opWriteBlock
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 received exception
java.io.IOException: Terminating due to a checksum error.java.io.IOException:
Unexpected checksum mismatch while writing
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 from /127.0.0.1:49059
2016-08-20 15:41:58,084 ERROR datanode.DataNode (DataXceiver.java:run(273)) -
127.0.0.1:52977:DataXceiver error processing WRITE_BLOCK operation src:
/127.0.0.1:49059 dst: /127.0.0.1:52977
java.io.IOException: Terminating due to a checksum error.java.io.IOException:
Unexpected checksum mismatch while writing
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 from /127.0.0.1:49059
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:606)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:745)
2016-08-20 15:41:58,258 INFO BlockStateChange
(BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask
127.0.0.1:51819 to delete [blk_1073741825_1002]
2016-08-20 15:41:58,258 INFO BlockStateChange
(BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask
127.0.0.1:39731 to delete [blk_1073741825_1002]
2016-08-20 15:41:58,258 INFO BlockStateChange
(BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask
127.0.0.1:52977 to delete [blk_1073741825_1002]
2016-08-20 15:41:59,235 INFO BlockStateChange (InvalidateBlocks.java:add(116))
- BLOCK* InvalidateBlocks: add blk_1073741825_1001 to 127.0.0.1:49498
2016-08-20 15:41:59,238 INFO impl.FsDatasetAsyncDiskService
(FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling
blk_1073741825_1002 file
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
for deletion
2016-08-20 15:41:59,240 INFO impl.FsDatasetAsyncDiskService
(FsDatasetAsyncDiskService.java:run(295)) - Deleted
BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
2016-08-20 15:41:59,378 INFO impl.FsDatasetAsyncDiskService
(FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling
blk_1073741825_1002 file
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data9/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
for deletion
2016-08-20 15:41:59,378 INFO impl.FsDatasetAsyncDiskService
(FsDatasetAsyncDiskService.java:run(295)) - Deleted
BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data9/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
2016-08-20 15:41:59,698 INFO impl.FsDatasetAsyncDiskService
(FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling
blk_1073741825_1002 file
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data17/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
for deletion
2016-08-20 15:41:59,698 INFO impl.FsDatasetAsyncDiskService
(FsDatasetAsyncDiskService.java:run(295)) - Deleted
BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data17/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
2016-08-20 15:42:01,259 INFO BlockStateChange
(BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask
127.0.0.1:49498 to delete [blk_1073741825_1001]
2016-08-20 15:42:02,098 INFO impl.FsDatasetImpl
(FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released.
2016-08-20 15:42:02,232 INFO impl.FsDatasetAsyncDiskService
(FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling
blk_1073741825_1001 file
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-1703495320-172.17.0.1-1471707714371/current/rbw/blk_1073741825
for deletion
2016-08-20 15:42:02,233 INFO impl.FsDatasetAsyncDiskService
(FsDatasetAsyncDiskService.java:run(295)) - Deleted
BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1001 file
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-1703495320-172.17.0.1-1471707714371/current/rbw/blk_1073741825
2016-08-20 15:42:03,051 INFO impl.FsDatasetImpl
(FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released.
2016-08-20 15:42:03,056 INFO impl.FsDatasetImpl
(FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released.
2016-08-20 15:42:25,287 INFO blockmanagement.CacheReplicationMonitor
(CacheReplicationMonitor.java:run(179)) - Rescanning after 30000 milliseconds
2016-08-20 15:42:25,288 INFO blockmanagement.CacheReplicationMonitor
(CacheReplicationMonitor.java:run(202)) - Scanned 0 directive(s) and 0 block(s)
in 1 millisecond(s).
2016-08-20 15:42:43,521 WARN hdfs.DFSClient
(DFSOutputStream.java:closeResponder(865)) - Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:863)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeInternal(DFSOutputStream.java:831)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:827)
2016-08-20 15:42:43,522 INFO datanode.DataNode
(BlockReceiver.java:receiveBlock(935)) - Exception for
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005
java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:745)
2016-08-20 15:42:43,523 INFO datanode.DataNode
(DataXceiver.java:writeBlock(835)) - opWriteBlock
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 received exception
java.nio.channels.ClosedByInterruptException
2016-08-20 15:42:43,523 ERROR datanode.DataNode (DataXceiver.java:run(273)) -
127.0.0.1:35247:DataXceiver error processing WRITE_BLOCK operation src:
/127.0.0.1:44749 dst: /127.0.0.1:35247
java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:745)
2016-08-20 15:42:43,523 INFO hdfs.DFSClient
(TestCrcCorruption.java:testCorruptionDuringWrt(143)) - Got expected exception
java.io.InterruptedIOException: Interrupted while waiting for data to be
acknowledged by pipeline
at
org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2442)
at
org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2420)
at
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2582)
at
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2541)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at
org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:139)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
2016-08-20 15:42:43,523 INFO hdfs.MiniDFSCluster
(MiniDFSCluster.java:shutdown(1744)) - Shutting down the Mini HDFS Cluster
{noformat}
> Intermittent test failure
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
> ------------------------------------------------------------------------------------------
>
> Key: HDFS-6532
> URL: https://issues.apache.org/jira/browse/HDFS-6532
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, hdfs-client
> Affects Versions: 2.4.0
> Reporter: Yongjun Zhang
> Assignee: Yiqun Lin
> Attachments: HDFS-6532.001.patch,
> TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml
>
>
> Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had
> the following failure. Local rerun is successful
> {code}
> Regression
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
> Failing for the past 1 build (Since Failed#1774 )
> Took 50 sec.
> Error Message
> test timed out after 50000 milliseconds
> Stacktrace
> java.lang.Exception: test timed out after 50000 milliseconds
> at java.lang.Object.wait(Native Method)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98)
> at
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133)
> {code}
> See relevant exceptions in log
> {code}
> 2014-06-14 11:56:15,283 WARN datanode.DataNode
> (BlockReceiver.java:verifyChunks(404)) - Checksum error in block
> BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from
> /127.0.0.1:41708
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112
> at
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353)
> at
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
> at java.lang.Thread.run(Thread.java:662)
> 2014-06-14 11:56:15,285 WARN datanode.DataNode
> (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run():
> java.io.IOException: Shutting down writer and responder due to a checksum
> error in received data. The error response has been sent upstream.
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199)
> at java.lang.Thread.run(Thread.java:662)
> ...
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]