virajjasani commented on code in PR #5432:
URL: https://github.com/apache/hadoop/pull/5432#discussion_r1121093583


##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java:
##########
@@ -1101,15 +1099,12 @@ public void testReportBadBlocks() throws Exception {
 
       block = DFSTestUtil.getFirstBlock(fs, filePath);
       // Test for the overloaded method reportBadBlocks
-      dataNode.reportBadBlocks(block, dataNode.getFSDataset()
-          .getFsVolumeReferences().get(0));
-      Thread.sleep(3000);
-      BlockManagerTestUtil.updateState(cluster.getNamesystem()
-          .getBlockManager());
-      // Verify the bad block has been reported to namenode
-      Assert.assertEquals(1, 
cluster.getNamesystem().getCorruptReplicaBlocks());
-    } finally {
-      cluster.shutdown();
+      dataNode.reportBadBlocks(block, 
dataNode.getFSDataset().getFsVolumeReferences().get(0));
+      GenericTestUtils.waitFor(() -> {
+        
BlockManagerTestUtil.updateState(cluster.getNamesystem().getBlockManager());
+        // Verify the bad block has been reported to namenode
+        return 1 == cluster.getNamesystem().getCorruptReplicaBlocks();
+      }, 100, 10000, "Corrupted replica blocks could not be found");

Review Comment:
   You are correct in the sense that it is not heartbeat trigger as such that 
makes namenode aware of the bad block, but it is the trigger of 
`ReportBadBlockAction#reportTo` that calls `DatanodeProtocol#reportBadBlocks` 
and hence Namenode gets notified. All this happens as part of 
`BPServiceActor#processQueueMessages`, you are right.
   
   However, it takes place immediately after sending heartbeat, as part of the 
same `offerService` call. Hence, the point here is, as long as the main thread 
(test thread) is waiting for enough time to let this whole round of 
`offerService` to complete, we should be good. This is exactly what 
`DataNodeTestUtils#triggerHeartbeat` takes care of, perhaps there might be some 
room for improvement as mentioned above that we could increase extra wait time 
from 100ms to maybe a bit higher, but I am sure this utility is used by 
multiple tests and the wait time would have been sufficient so far (as no one 
has attempted to change it so far).
   
   But as to what you mentioned, it's correct. The main purpose however is for 
us to let `offerService` complete it's whole run and we wait by then as part of 
`DataNodeTestUtils#triggerHeartbeat`, hence I am not able to reproduce flaky 
behavior with this change, at least locally. Does this sound good to you 
@tomscut?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to