Wei-Chiu Chuang created HDFS-11260:
--------------------------------------
Summary: Slow writer threads are not stopped
Key: HDFS-11260
URL: https://issues.apache.org/jira/browse/HDFS-11260
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.7.0
Reporter: Wei-Chiu Chuang
If a DataNode receives a transferred block, it tries to stop writer to the same
block. However, this may not work, and we saw the following error message and
stacktrace.
Fundamentally, the assumption of {{ReplicaInPipeline#stopWriter}} is wrong. It
assumes the writer thread must be a DataXceiver thread, which it can be
interrupted and terminates afterwards. However, IPC threads may also be the
writer thread by calling initReplicaRecovery, and which ignores interrupt and
do not terminate.
{noformat}
2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Join on writer thread Thread[IPC Server handler 6 on 50020,5,main] timed out
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:135)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2052)
2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
IOException in BlockReceiver constructor. Cause is
2016-12-16 19:58:56,168 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
sj1dra082.corp.adobe.com:50010:DataXceiver error processing WRITE_BLOCK
operation src: /10.10.0.80:44105 dst: /10.10.0.82:50010
java.io.IOException: Join on writer thread Thread[IPC Server handler 6 on
50020,5,main] timed out
at
org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:212)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1579)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:195)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:669)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:745)
{noformat}
There is also a logic error in FsDatasetImpl#createTemporary, in which if the
code in the synchronized block executes for more than 60 seconds (in theory),
it could throw an exception, without trying to stop the existing slow writer.
We saw a FsDatasetImpl#createTemporary failed after nearly 10 minutes, and it's
unclear why yet.
{noformat}
2016-12-16 23:12:24,636 WARN
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Unable to
stop existing writer for block
BP-1527842723-10.0.0.180-1367984731269:blk_4313782210_1103780331023 after
568320 miniseconds.
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]