[
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250035#comment-17250035
]
Caroline commented on HBASE-25032:
----------------------------------
[~anoop.hbase] Once Master acknowledges the reportForDuty and sends the
response back to the RS, RS performs all the actions within [this
handleReportForDuty()
method|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1494].
Aside from setting up WAL and replication, this method also sets up an
ephemeral znode, initializes file system, sets up metrics, starts service
threads, starts heap memory manager, and sets the RS internal 'online' boolean
to true, among other tasks. So, it seems that RS is doing a lot of vital and
time-consuming setup after reporting for duty to master.
Therefore, I believe it makes sense for Master to delay placing the RS into its
onlineServers list until after the RS has completed all of above tasks. The
approach taken in the PRs is to leave the RS reportForDuty/handleReportForDuty
logic as is, and change the Master-side logic so that Master asynchronously
polls for the RS's internal 'online' boolean to be set to true before placing
the RS into its onlineServers list (this will happen at the end of RS's
handleReportForDuty method).
The flow looks something like this:
RS starts up -> RS sends reportForDuty to Master -> Master acknowledges
reportForDuty, sends response to RS; at the same time, Master spawns thread to
poll for RS 'online' flag (i.e. RS setup complete) -> RS receives
'reportForDuty received' acknowledgement from Master -> RS finishes setup, sets
its 'online' flag to true -> Master sees RS has finished setup -> Master adds
RS to Master's onlineServers list.
> Wait for region server to become online before adding it to online servers in
> Master
> ------------------------------------------------------------------------------------
>
> Key: HBASE-25032
> URL: https://issues.apache.org/jira/browse/HBASE-25032
> Project: HBase
> Issue Type: Bug
> Reporter: Sandeep Guggilam
> Assignee: Caroline
> Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges
> the request and adds it to the onlineServers list for further assigning any
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS
> does a bunch of stuff like initializing replication sources etc before
> becoming online. However, sometimes there could be an issue with initializing
> replication sources when it is unable to connect to peer clusters because of
> some kerberos configuration and there would be a delay of around 20 mins in
> becoming online.
>
> Since master considers it online, it tries to assign regions and which fails
> with ServerNotRunningYet exception, then the master tries to unassign which
> again fails with the same exception leading the region to FAILED_CLOSE state.
>
> It would be good to have a check to see if the RS is ready to accept the
> assignment requests before adding it to online servers list which would
> account for any such delays as described above
--
This message was sent by Atlassian Jira
(v8.3.4#803005)