[jira] [Commented] (HDFS-9435) TestBlockRecovery#testRBWReplicas is failing intermittently

Rakesh R (JIRA) Tue, 17 Nov 2015 08:17:53 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008932#comment-15008932
 ]


Rakesh R commented on HDFS-9435:
--------------------------------

It looks like there is a race between this waiting period and 
BPServiceActor#scheduleNextHeartbeat() call by BPServiceActor#offerService().
{code}
  void triggerHeartbeatForTests() {
    synchronized (pendingIncrementalBRperStorage) {
      final long nextHeartbeatTime = scheduler.scheduleHeartbeat();
      pendingIncrementalBRperStorage.notifyAll();
      while (nextHeartbeatTime - scheduler.nextHeartbeatTime >= 0) {
        try {
          pendingIncrementalBRperStorage.wait(100);
        } catch (InterruptedException e) {
          return;
        }
      }
    }
  }
{code}

Execution Sequence results in test case failure:-

1=> During starts, its calling 
{{dn.getAllBpOs().get(0).triggerHeartbeatForTests()}} and initializing {{final 
long nextHeartbeatTime = scheduler.scheduleHeartbeat();}}
2=> BPServiceActor#offerService()
3=> BPServiceActor#sendHeartBeat()
4=> BPServiceActor.scheduler.scheduleNextHeartbeat()
5=> Now, immediately {{nextHeartbeatTime - scheduler.nextHeartbeatTime >= 0}} 
satisifies and #triggerHeartbeatForTests() stops waiting period and starts unit 
testing.
6=> During tests, it will try to get 
{{BlockRecoveryWorker#getActiveNamenodeForBP()}} and see null ActiveNN, then 
throws exception. Because BPServiceActor#offerService() execution is still in 
progress and not yet updated the ActiveNN.
{code}
    DatanodeProtocolClientSideTranslatorPB activeNN = bpos.getActiveNN();
    if (activeNN == null) {
      throw new IOException(
          "Block pool " + bpid + " has not recognized an active NN");
    }
{code}


> TestBlockRecovery#testRBWReplicas is failing intermittently
> -----------------------------------------------------------
>
>                 Key: HDFS-9435
>                 URL: https://issues.apache.org/jira/browse/HDFS-9435
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Rakesh R
>            Assignee: Rakesh R
>         Attachments: testRBWReplicas.log
>
>
> TestBlockRecovery#testRBWReplicas is failing in the [build 
> 13536|https://builds.apache.org/job/PreCommit-HDFS-Build/13536/testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockRecovery/testRBWReplicas/].
>  It looks like bug in tests due to race condition.
> Note: Attached logs taken from the build to this jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9435) TestBlockRecovery#testRBWReplicas is failing intermittently

Reply via email to