Viraj Jasani created HBASE-28638:
------------------------------------

             Summary: RSProcedureDispatcher to fail-fast for connection closed 
errors
                 Key: HBASE-28638
                 URL: https://issues.apache.org/jira/browse/HBASE-28638
             Project: HBase
          Issue Type: Sub-task
    Affects Versions: 2.5.8
            Reporter: Viraj Jasani
             Fix For: 3.0.0-beta-2, 2.6.1, 2.5.9


As per one of the recent incidents, some regions faced 5+ minute of 
availability drop because before active master could initiate SCP for the dead 
server, some region moves tried to assign regions on the already dead 
regionserver. Sometimes, due to transient issues, we see that active master 
gets notified after few minutes (5+ minute in this case).
{code:java}
2024-05-08 03:47:38,518 WARN  [RSProcedureDispatcher-pool-4790] 
procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed 
due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
address=host1:61020 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection 
closed, try=0, retrying... {code}
And as we know, we have infinite retries here, so it kept going on..

 

Eventually, SCP could be initiated only after active master discovered the 
server as dead:
{code:java}
2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer - 
Processing host1,61020,1713411866443; numProcessing=1

2024-05-08 03:50:01,038 INFO  [RegionServerTracker-0] 
master.RegionServerTracker - RegionServer ephemeral node deleted, processing 
expiration [host1,61020,1713411866443] {code}
leading to
{code:java}
2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833] 
assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691, 
state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51, 
server=host1,61020,1713411866443 for region state=OPENING, 
location=host1,61020,1713411866443, table=T1, 
region=5cafbe54d5685acc6c4866758e67fd51, targetServer host1,61020,1713411866443 
is dead, SCP will interrupt us, give up {code}
This entire duration of outage could be avoided if we can fail-fast for 
connection drop errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to