[
https://issues.apache.org/jira/browse/HDFS-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391504#comment-14391504
]
Colin Patrick McCabe edited comment on HDFS-7999 at 4/1/15 9:45 PM:
--------------------------------------------------------------------
Thanks for the explanation¸ [~sinago]. The patch makes sense.
{code}
1455 // Hang too long, just bail out. This is not supposed to happen.
1456 if (Time.monotonicNow() - startTime > bailOutDuration) {
1457 break;
1458 }
{code}
Can you throw an exception from here rather than breaking? It seems like it
would be clearer. Also, please log a WARN message to explain that there has
been a problem. I would prefer to just see a log message rather than a comment
explaining that "this is not supposed to happen"
Instead of naming the timeout period "bailOutDuration", how about something
like "writerStopTimeoutMs". In general, timeouts that are in milliseconds
should end in ms.
{code}
1464 throw new IOException("Hang " + ((Time.monotonicNow() - startTime) /
1000)
1465 + " seconds in createTemporary, just bail out");
{code}
This error message seems confusing. It should be something like "Unable to
stop existing writer for $REPLICA after $WHATEVER milliseconds."
I think it looks good aside from that.
bq. [~xinwei] wrote: Make the heartbeat lockless can avoid the happening of
dead DataNode, and I think it is a necessary
I think it is a good idea to make the heartbeat lockless. However, it is an
exaggeration to say that it is necessary. The heartbeat wasn't lockless in
previous releases of Hadoop such as 2.1, 2.3, or 2.5 and there were no
complaints.
was (Author: cmccabe):
bq. Make the heartbeat lockless can avoid the happening of dead DataNode, and I
think it is a necessary
The heartbeat wasn't lockless in Hadoop 2.1, 2.3, or 2.5. It's clearly not
necessary to make the heartbeat lockless. It may be a good optimization, but
it is not a bug that the heartbeat takes locks.
> FsDatasetImpl#createTemporary sometimes holds the FSDatasetImpl lock for a
> very long time
> -----------------------------------------------------------------------------------------
>
> Key: HDFS-7999
> URL: https://issues.apache.org/jira/browse/HDFS-7999
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: zhouyingchao
> Assignee: zhouyingchao
> Attachments: HDFS-7999-001.patch
>
>
> I'm using 2.6.0 and noticed that sometime DN's heartbeat were delayed for
> very long time, say more than 100 seconds. I get the jstack twice and looks
> like they are all blocked (at getStorageReport) by dataset lock, and which is
> held by a thread that is calling createTemporary, which again is blocked to
> wait earlier incarnation writer to exit.
> The heartbeat thread stack:
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
> - waiting to lock <0x00000007b01428c0> (a
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
> - locked <0x00000007b0140ed0> (a java.lang.Object)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
> at java.lang.Thread.run(Thread.java:662)
> The DataXceiver thread holds the dataset lock:
> "DataXceiver for client at XXXXX" daemon prio=10 tid=0x00007f14041e6480
> nid=0x52bc in Object.wait() [0x00007f11d78f7000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> locked <0x00000007a33b85d8> (a org.apache.hadoop.util.Daemon)
> at
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1231)
> locked <0x00000007b01428c0> (a
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:114)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:179)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:615)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)