[ 
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243744#comment-17243744
 ] 

Anoop Sam John commented on HBASE-25032:
----------------------------------------

{quote}
I spent some time looking at the code today. One thing I noticed is that we 
abort the RS by throwing exception in case of any issues with replication setup 
with the peer during the startup of RS : 
https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/Replication.java#L241

So looks like the current design already treats some aspects of setting up the 
replication as important and aborts the RS if not setup properly as opposed to 
our thought of letting RS accept requests even if replication fails in an async 
thread
{quote}
If replication is enabled and we can not set it up in an RS instace, aborting 
that looks correct. Else the data in this RS will never get replicated.  Once 
the RS aborts, its WAL replication Q will get assigned to another healthy RS.  
So it should be very much ok we try this replication setup in an async thread.  
Ya it may some time and till then all writes will be in backlog and later it 
will get replicated.  If after attempt the replication setup fails (rare chance 
anyways right) it will abort RS then.
bq.Once Master acknowledges the reportForDuty and sends back the response, RS 
does a bunch of stuff like initializing replication sources etc before becoming 
online. 
What else other than this repliction setup?  Can you put all?  We can think 
anything else which might be time consuming.  That can really help to decide 
whether we should really need yet another step of informing HM from RS that its 
ready for taking up regions load.


> Wait for region server to become online before adding it to online servers in 
> Master
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-25032
>                 URL: https://issues.apache.org/jira/browse/HBASE-25032
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sandeep Guggilam
>            Assignee: Caroline
>            Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges 
> the request and adds it to the onlineServers list for further assigning any 
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS 
> does a bunch of stuff like initializing replication sources etc before 
> becoming online. However, sometimes there could be an issue with initializing 
> replication sources when it is unable to connect to peer clusters because of 
> some kerberos configuration and there would be a delay of around 20 mins in 
> becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails 
> with ServerNotRunningYet exception, then the master tries to unassign which 
> again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the 
> assignment requests before adding it to online servers list which would 
> account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to