[ 
https://issues.apache.org/jira/browse/HDFS-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391504#comment-14391504
 ] 

Colin Patrick McCabe edited comment on HDFS-7999 at 4/1/15 9:45 PM:
--------------------------------------------------------------------

Thanks for the explanation¸ [~sinago].  The patch makes sense.

{code}
1455          // Hang too long, just bail out. This is not supposed to happen.
1456          if (Time.monotonicNow() - startTime > bailOutDuration) {
1457            break;
1458          }
{code}

Can you throw an exception from here rather than breaking?  It seems like it 
would be clearer.  Also, please log a WARN message to explain that there has 
been a problem.  I would prefer to just see a log message rather than a comment 
explaining that "this is not supposed to happen"

Instead of naming the timeout period "bailOutDuration", how about something 
like "writerStopTimeoutMs".  In general, timeouts that are in milliseconds 
should end in ms.

{code}
1464    throw new IOException("Hang " + ((Time.monotonicNow() - startTime) / 
1000)
1465            + " seconds in createTemporary, just bail out");
{code}

This error message seems confusing.  It should be something like "Unable to 
stop existing writer for $REPLICA after $WHATEVER milliseconds."

I think it looks good aside from that.

bq. [~xinwei] wrote: Make the heartbeat lockless can avoid the happening of 
dead DataNode, and I think it is a necessary

I think it is a good idea to make the heartbeat lockless.  However, it is an 
exaggeration to say that it is necessary.  The heartbeat wasn't lockless in 
previous releases of Hadoop such as 2.1, 2.3, or 2.5 and there were no 
complaints.


was (Author: cmccabe):
bq. Make the heartbeat lockless can avoid the happening of dead DataNode, and I 
think it is a necessary

The heartbeat wasn't lockless in Hadoop 2.1, 2.3, or 2.5.  It's clearly not 
necessary to make the heartbeat lockless.  It may be a good optimization, but 
it is not a bug that the heartbeat takes locks.

> FsDatasetImpl#createTemporary sometimes holds the FSDatasetImpl lock for a 
> very long time
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-7999
>                 URL: https://issues.apache.org/jira/browse/HDFS-7999
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: zhouyingchao
>            Assignee: zhouyingchao
>         Attachments: HDFS-7999-001.patch
>
>
> I'm using 2.6.0 and noticed that sometime DN's heartbeat were delayed for 
> very long time, say more than 100 seconds. I get the jstack twice and looks 
> like they are all blocked (at getStorageReport) by dataset lock, and which is 
> held by a thread that is calling createTemporary, which again is blocked to 
> wait earlier incarnation writer to exit.
> The heartbeat thread stack:
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
>         - waiting to lock <0x00000007b01428c0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
>         - locked <0x00000007b0140ed0> (a java.lang.Object)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
>         at java.lang.Thread.run(Thread.java:662)
> The DataXceiver thread holds the dataset lock:
> "DataXceiver for client at XXXXX" daemon prio=10 tid=0x00007f14041e6480 
> nid=0x52bc in Object.wait() [0x00007f11d78f7000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> locked <0x00000007a33b85d8> (a org.apache.hadoop.util.Daemon)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1231)
> locked <0x00000007b01428c0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:114)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:179)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:615)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to