caroliney14 commented on a change in pull request #2769:
URL: https://github.com/apache/hbase/pull/2769#discussion_r599168657



##########
File path: hbase-server/src/test/java/org/apache/hadoop/hbase/TestZooKeeper.java
##########
@@ -176,7 +176,7 @@ private void testSanity(final String testName) throws 
Exception {
   @Test
   public void testRegionAssignmentAfterMasterRecoveryDueToZKExpiry() throws 
Exception {
     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
-    cluster.startRegionServer();
+    cluster.startRegionServerAndWait(2000);

Review comment:
       `startRegionServer` and `startRegionServerAndWait` do the same thing 
except the latter waits up to the passed in timeout to poll for the rs's online 
status in master (it will return early if it finds the rs is online before the 
timeout). the `startRegionServerAndWait` method gives the master a bit more 
time to get the updated `online` status from regionserver (which is updated 
after `HRegionServer.handleReportForDutyResponse` but only reported to master a 
while later, in `HRegionServer.tryRegionServerReport`). that's why master often 
needs an extra second to know the rs is alive, and if we try to proceed to do 
other tasks without this wait (such as assign regions), master will run into 
errors because it doesn't know the rs is alive yet. 
   
   to be safe, we could make the passed in timeout longer. it won't change the 
behavior because the method will return early if it finds the rs status is 
online, but it'll reduce chance of flakiness. it seems like 2000ms is enough 
for most cases though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to