[
https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852608#comment-17852608
]
Duo Zhang commented on HBASE-28638:
-----------------------------------
Agree with [~apurtell] that here force scheduling a SCP is the correct way to
recover the cluster and also keep things correct.
For me, I think maybe we could add a new feature to hbase Canary? Once a canary
detects that a regionserver is in trouble, i.e, not responding, or very very
slow, we could try to kill/restart the region server.
We could make this module pluggable, so you are free to implement it based on
your deployment type. For example, a general way maybe just force delete the
znode on zookeeper, and for typical machine/ECS deployment, you can ssh to the
machine and use kill -9, and on K8s deployment, you could use api or kubectl to
restart the pod.
WDYT?
Thanks.
> RSProcedureDispatcher to fail-fast for connection closed errors
> ---------------------------------------------------------------
>
> Key: HBASE-28638
> URL: https://issues.apache.org/jira/browse/HBASE-28638
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 2.5.8
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
> Fix For: 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> As per one of the recent incidents, some regions faced 5+ minute of
> availability drop because before active master could initiate SCP for the
> dead server, some region moves tried to assign regions on the already dead
> regionserver. Sometimes, due to transient issues, we see that active master
> gets notified after few minutes (5+ minute in this case).
> {code:java}
> 2024-05-08 03:47:38,518 WARN [RSProcedureDispatcher-pool-4790]
> procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed
> due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to
> address=host1:61020 failed on local exception:
> org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection
> closed, try=0, retrying... {code}
> And as we know, we have infinite retries here, so it kept going on..
>
> Eventually, SCP could be initiated only after active master discovered the
> server as dead:
> {code:java}
> 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer -
> Processing host1,61020,1713411866443; numProcessing=1
> 2024-05-08 03:50:01,038 INFO [RegionServerTracker-0]
> master.RegionServerTracker - RegionServer ephemeral node deleted, processing
> expiration [host1,61020,1713411866443] {code}
> leading to
> {code:java}
> 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833]
> assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691,
> state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51,
> server=host1,61020,1713411866443 for region state=OPENING,
> location=host1,61020,1713411866443, table=T1,
> region=5cafbe54d5685acc6c4866758e67fd51, targetServer
> host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code}
> This entire duration of outage could be avoided if we can fail-fast for
> connection drop errors.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)