[
https://issues.apache.org/jira/browse/HDFS-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhouyingchao updated HDFS-7999:
-------------------------------
Status: Patch Available (was: Open)
The fix is to call stopWriter w/o the FsDatasetImpl lock. However, without
lock, another thread may slip in and inject another ReplicaInfo to the map when
we stop the writter. To resolve the issue, we will try to invalidate stale
replica in a loop. As the last resort, if we hang in the thread too long, we
will bail out the loop with an IOException.
> DN Hearbeat is blocked by waiting FsDatasetImpl lock
> ----------------------------------------------------
>
> Key: HDFS-7999
> URL: https://issues.apache.org/jira/browse/HDFS-7999
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: zhouyingchao
> Assignee: zhouyingchao
> Attachments: HDFS-7999-001.patch
>
>
> I'm using 2.6.0 and noticed that sometime DN's heartbeat were delayed for
> very long time, say more than 100 seconds. I get the jstack twice and looks
> like they are all blocked (at getStorageReport) by dataset lock, and which is
> held by a thread that is calling createTemporary, which again is blocked to
> wait earlier incarnation writer to exit.
> The heartbeat thread stack:
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
> - waiting to lock <0x00000007b01428c0> (a
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
> - locked <0x00000007b0140ed0> (a java.lang.Object)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
> at java.lang.Thread.run(Thread.java:662)
> The DataXceiver thread holds the dataset lock:
> "DataXceiver for client at XXXXX" daemon prio=10 tid=0x00007f14041e6480
> nid=0x52bc in Object.wait() [0x00007f11d78f7000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> locked <0x00000007a33b85d8> (a org.apache.hadoop.util.Daemon)
> at
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1231)
> locked <0x00000007b01428c0> (a
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:114)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:179)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:615)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)