caroliney14 commented on pull request #2769: URL: https://github.com/apache/hbase/pull/2769#issuecomment-745705944
@Apache9 At the time of `reportForDuty`, there is no way to know whether RS is ready to take regions. After Master acknowledges the `reportForDuty` and sends the response back to the RS, RS performs all the actions within [this handleReportForDuty() method](https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1494), including creating an ephemeral znode, initializing file system, setting up WAL and replication, setting up metrics, starting service threads, starting heap memory manager, and setting the RS internal `online` boolean to true, among other tasks. So, it seems that RS is doing a lot of vital and time-consuming setup *after* reporting for duty to Master. The flow is as follows: RS starts up -> RS sends `reportForDuty` to Master -> Master adds RS to Master's onlineServers list, and sends response to RS -> RS receives `reportForDuty` response from Master -> RS finishes setup, including setting up WAL and replication. The issue we observed which led to the creation of this JIRA is a ~20min delay in RS initializing replication sources when RS is unable to connect to peer clusters because of some kerberos configuration. And since Master considers the RS online, it tries to assign regions, which fails with `ServerNotRunningYetException`. There are a few ways to address this issue, described [here](https://issues.apache.org/jira/browse/HBASE-25032?focusedCommentId=17241847&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17241847). The approach taken in this PR alters the handling of `reportForDuty` on the Master side. New flow is as follows: RS starts up -> RS sends reportForDuty to Master -> Master acknowledges reportForDuty, sends response to RS; at the same time, Master spawns thread to poll for RS 'online' flag (i.e. RS setup complete) -> RS receives 'reportForDuty received' acknowledgement from Master -> RS finishes setup, sets its 'online' flag to true -> Master sees RS has finished setup -> Master adds RS to Master's onlineServers list. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org