[
https://issues.apache.org/jira/browse/HDFS-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaobo Peng updated HDFS-3655:
------------------------------
Attachment: HDFS-3655-0.22-use-join-instead-of-wait.patch
HDFS-3655-0.22.patch keeps the code that waits for termination of the old
writer in stopWriter method of class ReplicaInPipeline. We pay the price to
pass FSDataset to it in order to release the monitor during waiting.
HDFS-3655-0.22-use-join-instead-of-wait.patch moves join/wait code from
ReplicaInPipeline to FSDataset. It looks cleaner. But we have to refactor some
FSDataset code to not duplicate them.
> datenode recoverRbw could hang sometime
> ---------------------------------------
>
> Key: HDFS-3655
> URL: https://issues.apache.org/jira/browse/HDFS-3655
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: data-node
> Affects Versions: 0.22.0, 1.0.3, 2.0.1-alpha
> Reporter: Ming Ma
> Fix For: 0.22.1
>
> Attachments: HDFS-3655-0.22-use-join-instead-of-wait.patch,
> HDFS-3655-0.22.patch
>
>
> This bug seems to apply to 0.22 and hadoop 2.0. I will upload the initial fix
> done by my colleague Xiaobo Peng shortly ( there is some logistics issue
> being worked on so that he can upload patch himself later ).
> recoverRbw try to kill the old writer thread, but it took the lock (FSDataset
> monitor object) which the old writer thread is waiting on ( for example the
> call to data.getTmpInputStreams ).
> "DataXceiver for client /10.110.3.43:40193 [Receiving block
> blk_-3037542385914640638_57111747
> client=DFSClient_attempt_201206021424_0001_m_000401_0]" daemon prio=10
> tid=0x00007facf8111800 nid=0x6b64 in Object.wait() [0x00007facd1ddb000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1186)
> ■locked <0x00000007856c1200> (a org.apache.hadoop.util.Daemon)
> at java.lang.Thread.join(Thread.java:1239)
> at
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:158)
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset.recoverRbw(FSDataset.java:1347)
> ■locked <0x00000007838398c0> (a
> org.apache.hadoop.hdfs.server.datanode.FSDataset)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:119)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlockInternal(DataXceiver.java:391)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:327)
> at
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:405)
> at
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:344)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183)
> at java.lang.Thread.run(Thread.java:662)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira