[ 
https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852324#comment-17852324
 ] 

Duo Zhang commented on HBASE-28638:
-----------------------------------

HBCKSCP is what we need to do when there is code bug in HBase or data 
corruption which causes master fail to schedule SCP for some dead servers. It 
is not in the normal path so it must be scheduled manually, by operators, as 
you said.

I still think the root problem here is why we can only schedule SCP after 5 
minutes... What is the problem here? What prevents the active master schedules 
SCP?

> RSProcedureDispatcher to fail-fast for connection closed errors
> ---------------------------------------------------------------
>
>                 Key: HBASE-28638
>                 URL: https://issues.apache.org/jira/browse/HBASE-28638
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.5.8
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>             Fix For: 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> As per one of the recent incidents, some regions faced 5+ minute of 
> availability drop because before active master could initiate SCP for the 
> dead server, some region moves tried to assign regions on the already dead 
> regionserver. Sometimes, due to transient issues, we see that active master 
> gets notified after few minutes (5+ minute in this case).
> {code:java}
> 2024-05-08 03:47:38,518 WARN  [RSProcedureDispatcher-pool-4790] 
> procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed 
> due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
> address=host1:61020 failed on local exception: 
> org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection 
> closed, try=0, retrying... {code}
> And as we know, we have infinite retries here, so it kept going on..
>  
> Eventually, SCP could be initiated only after active master discovered the 
> server as dead:
> {code:java}
> 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer - 
> Processing host1,61020,1713411866443; numProcessing=1
> 2024-05-08 03:50:01,038 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker - RegionServer ephemeral node deleted, processing 
> expiration [host1,61020,1713411866443] {code}
> leading to
> {code:java}
> 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833] 
> assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691, 
> state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51, 
> server=host1,61020,1713411866443 for region state=OPENING, 
> location=host1,61020,1713411866443, table=T1, 
> region=5cafbe54d5685acc6c4866758e67fd51, targetServer 
> host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code}
> This entire duration of outage could be avoided if we can fail-fast for 
> connection drop errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to