[
https://issues.apache.org/jira/browse/HDFS-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junping Du updated HDFS-11260:
------------------------------
Target Version/s: 3.0.0-beta1 (was: 2.8.0, 3.0.0-beta1)
> Slow writer threads are not stopped
> -----------------------------------
>
> Key: HDFS-11260
> URL: https://issues.apache.org/jira/browse/HDFS-11260
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.7.0
> Environment: CDH5.8.0
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
>
> If a DataNode receives a transferred block, it tries to stop writer to the
> same block. However, this may not work, and we saw the following error
> message and stacktrace.
> Fundamentally, the assumption of {{ReplicaInPipeline#stopWriter}} is wrong.
> It assumes the writer thread must be a DataXceiver thread, which it can be
> interrupted and terminates afterwards. However, IPC threads may also be the
> writer thread by calling initReplicaRecovery, and which ignores interrupt and
> do not terminate.
> {noformat}
> 2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> Join on writer thread Thread[IPC Server handler 6 on 50020,5,main] timed out
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:135)
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2052)
> 2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> IOException in BlockReceiver constructor. Cause is
> 2016-12-16 19:58:56,168 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> sj1dra082.corp.adobe.com:50010:DataXceiver error processing WRITE_BLOCK
> operation src: /10.10.0.80:44105 dst: /10.10.0.82:50010
> java.io.IOException: Join on writer thread Thread[IPC Server handler 6 on
> 50020,5,main] timed out
> at
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:212)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1579)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:195)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:669)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> There is also a logic error in FsDatasetImpl#createTemporary, in which if the
> code in the synchronized block executes for more than 60 seconds (in theory),
> it could throw an exception, without trying to stop the existing slow writer.
> We saw a FsDatasetImpl#createTemporary failed after nearly 10 minutes, and
> it's unclear why yet. It's my understanding that the code intends to stop
> slow writers after 1 minute by default. Some code rewrite is probably needed
> to get the logic right.
> {noformat}
> 2016-12-16 23:12:24,636 WARN
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Unable
> to stop existing writer for block
> BP-1527842723-10.0.0.180-1367984731269:blk_4313782210_1103780331023 after
> 568320 miniseconds.
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]