[ 
https://issues.apache.org/jira/browse/HDFS-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734468#comment-15734468
 ] 

Vinayakumar B commented on HDFS-9911:
-------------------------------------

I think analysis of [~tasanuma0829] makes sense. There is a chance that 
LifeLineSender sends the lifeline before BPServiceActor sends the heartbeat and 
postpones the next lifeline.
I think the problem is in {{BPServiceActor#Scheduler}} initial value of 
{{nextLifelineTime}} is same as {{nextHeartbeatTime}} and its 
{{monotonicNow()}}, so whichever thread starts first, will send its message. 
But first Lifeline should atleast wait for {{lifelineIntervalMs}} or 
{{heartbeatIntervalMs}}, so that heartbeat can go first. When the heartbeat 
sent successfully, then onwards lifeline messages will be scheduled properly.

So following change in {{BPServiceActor}} would do the needful I hope.
{code}@@ -1063,7 +1068,7 @@ private void sendLifeline() throws IOException {
     volatile long nextHeartbeatTime = monotonicNow();
 
     @VisibleForTesting
-    volatile long nextLifelineTime = monotonicNow();
+    volatile long nextLifelineTime;
 
     @VisibleForTesting
     volatile long lastBlockReportTime = monotonicNow();
@@ -1086,6 +1091,7 @@ private void sendLifeline() throws IOException {
       this.heartbeatIntervalMs = heartbeatIntervalMs;
       this.lifelineIntervalMs = lifelineIntervalMs;
       this.blockReportIntervalMs = blockReportIntervalMs;
+      scheduleNextLifeline(monotonicNow());
     }
 
     // This is useful to make sure NN gets Heartbeat before Blockreport
{code}


> TestDataNodeLifeline  Fails intermittently
> ------------------------------------------
>
>                 Key: HDFS-9911
>                 URL: https://issues.apache.org/jira/browse/HDFS-9911
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Anu Engineer
>            Assignee: Chris Nauroth
>             Fix For: 2.8.0
>
>
> In HDFS-1312 branch, we have a failure for this test.
> {{org.apache.hadoop.hdfs.server.datanode.TestDataNodeLifeline.testNoLifelineSentIfHeartbeatsOnTime}}
> {noformat}
> Error Message
> Expect metrics to count no lifeline calls. expected:<0> but was:<1>
> Stacktrace
> java.lang.AssertionError: Expect metrics to count no lifeline calls. 
> expected:<0> but was:<1>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeLifeline.testNoLifelineSentIfHeartbeatsOnTime(TestDataNodeLifeline.java:256)
> {noformat}
> Details can be found here.
> https://builds.apache.org/job/PreCommit-HDFS-Build/14726/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeLifeline/testNoLifelineSentIfHeartbeatsOnTime/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to