caroliney14 commented on a change in pull request #2769:
URL: https://github.com/apache/hbase/pull/2769#discussion_r599168657
##########
File path: hbase-server/src/test/java/org/apache/hadoop/hbase/TestZooKeeper.java
##########
@@ -176,7 +176,7 @@ private void testSanity(final String testName) throws
Exception {
@Test
public void testRegionAssignmentAfterMasterRecoveryDueToZKExpiry() throws
Exception {
MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
- cluster.startRegionServer();
+ cluster.startRegionServerAndWait(2000);
Review comment:
`startRegionServer` and `startRegionServerAndWait` do the same thing
except the latter waits up to the passed in timeout to poll for the rs's online
status in master (it will return early if it finds the rs is online before the
timeout). the `startRegionServerAndWait` method gives the master a bit more
time to get the updated `online` status from regionserver (which is updated
after `HRegionServer.handleReportForDutyResponse` but only reported to master a
while later, in `HRegionServer.tryRegionServerReport`). that's why master often
needs an extra second to know the rs is alive, and if we try to proceed to do
other tasks without this wait (such as assign regions), master will run into
errors because it doesn't know the rs is alive yet.
to be safe, we could make the passed in timeout longer. it won't change the
behavior because the method will return early if it finds the rs status is
online, but it'll reduce chance of flakiness. it seems like 2000ms is enough
for most cases though.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]