[
https://issues.apache.org/jira/browse/HDFS-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734468#comment-15734468
]
Vinayakumar B commented on HDFS-9911:
-------------------------------------
I think analysis of [~tasanuma0829] makes sense. There is a chance that
LifeLineSender sends the lifeline before BPServiceActor sends the heartbeat and
postpones the next lifeline.
I think the problem is in {{BPServiceActor#Scheduler}} initial value of
{{nextLifelineTime}} is same as {{nextHeartbeatTime}} and its
{{monotonicNow()}}, so whichever thread starts first, will send its message.
But first Lifeline should atleast wait for {{lifelineIntervalMs}} or
{{heartbeatIntervalMs}}, so that heartbeat can go first. When the heartbeat
sent successfully, then onwards lifeline messages will be scheduled properly.
So following change in {{BPServiceActor}} would do the needful I hope.
{code}@@ -1063,7 +1068,7 @@ private void sendLifeline() throws IOException {
volatile long nextHeartbeatTime = monotonicNow();
@VisibleForTesting
- volatile long nextLifelineTime = monotonicNow();
+ volatile long nextLifelineTime;
@VisibleForTesting
volatile long lastBlockReportTime = monotonicNow();
@@ -1086,6 +1091,7 @@ private void sendLifeline() throws IOException {
this.heartbeatIntervalMs = heartbeatIntervalMs;
this.lifelineIntervalMs = lifelineIntervalMs;
this.blockReportIntervalMs = blockReportIntervalMs;
+ scheduleNextLifeline(monotonicNow());
}
// This is useful to make sure NN gets Heartbeat before Blockreport
{code}
> TestDataNodeLifeline Fails intermittently
> ------------------------------------------
>
> Key: HDFS-9911
> URL: https://issues.apache.org/jira/browse/HDFS-9911
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.8.0
> Reporter: Anu Engineer
> Assignee: Chris Nauroth
> Fix For: 2.8.0
>
>
> In HDFS-1312 branch, we have a failure for this test.
> {{org.apache.hadoop.hdfs.server.datanode.TestDataNodeLifeline.testNoLifelineSentIfHeartbeatsOnTime}}
> {noformat}
> Error Message
> Expect metrics to count no lifeline calls. expected:<0> but was:<1>
> Stacktrace
> java.lang.AssertionError: Expect metrics to count no lifeline calls.
> expected:<0> but was:<1>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeLifeline.testNoLifelineSentIfHeartbeatsOnTime(TestDataNodeLifeline.java:256)
> {noformat}
> Details can be found here.
> https://builds.apache.org/job/PreCommit-HDFS-Build/14726/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeLifeline/testNoLifelineSentIfHeartbeatsOnTime/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]