[
https://issues.apache.org/jira/browse/HDFS-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008932#comment-15008932
]
Rakesh R commented on HDFS-9435:
--------------------------------
It looks like there is a race between this waiting period and
BPServiceActor#scheduleNextHeartbeat() call by BPServiceActor#offerService().
{code}
void triggerHeartbeatForTests() {
synchronized (pendingIncrementalBRperStorage) {
final long nextHeartbeatTime = scheduler.scheduleHeartbeat();
pendingIncrementalBRperStorage.notifyAll();
while (nextHeartbeatTime - scheduler.nextHeartbeatTime >= 0) {
try {
pendingIncrementalBRperStorage.wait(100);
} catch (InterruptedException e) {
return;
}
}
}
}
{code}
Execution Sequence results in test case failure:-
1=> During starts, its calling
{{dn.getAllBpOs().get(0).triggerHeartbeatForTests()}} and initializing {{final
long nextHeartbeatTime = scheduler.scheduleHeartbeat();}}
2=> BPServiceActor#offerService()
3=> BPServiceActor#sendHeartBeat()
4=> BPServiceActor.scheduler.scheduleNextHeartbeat()
5=> Now, immediately {{nextHeartbeatTime - scheduler.nextHeartbeatTime >= 0}}
satisifies and #triggerHeartbeatForTests() stops waiting period and starts unit
testing.
6=> During tests, it will try to get
{{BlockRecoveryWorker#getActiveNamenodeForBP()}} and see null ActiveNN, then
throws exception. Because BPServiceActor#offerService() execution is still in
progress and not yet updated the ActiveNN.
{code}
DatanodeProtocolClientSideTranslatorPB activeNN = bpos.getActiveNN();
if (activeNN == null) {
throw new IOException(
"Block pool " + bpid + " has not recognized an active NN");
}
{code}
> TestBlockRecovery#testRBWReplicas is failing intermittently
> -----------------------------------------------------------
>
> Key: HDFS-9435
> URL: https://issues.apache.org/jira/browse/HDFS-9435
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Rakesh R
> Assignee: Rakesh R
> Attachments: testRBWReplicas.log
>
>
> TestBlockRecovery#testRBWReplicas is failing in the [build
> 13536|https://builds.apache.org/job/PreCommit-HDFS-Build/13536/testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockRecovery/testRBWReplicas/].
> It looks like bug in tests due to race condition.
> Note: Attached logs taken from the build to this jira.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)