[jira] [Comment Edited] (HBASE-28638) RSProcedureDispatcher to fail-fast for connection closed errors

Viraj Jasani (Jira) Thu, 06 Jun 2024 10:43:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852637#comment-17852637
 ]


Viraj Jasani edited comment on HBASE-28638 at 6/6/24 5:42 PM:
--------------------------------------------------------------

{quote}force scheduling a SCP is the correct way to recover
{quote}
Just to be clear, "force SCP" is nothing but HBCKSCP.
{quote}For me, I think maybe we could add a new feature to hbase Canary? Once a 
canary detects that a regionserver is in trouble, i.e, not responding, or very 
very slow, we could try to kill/restart the region server.
{quote}
That's good solution as discussed on parent Jira also, but i think we can keep 
this improvement as separate Jira than tightly couple it with the current 
issues.

 

For the given issue (or series of issues), what we can do is:
 # Whenever any TRSP gets stuck while making RPC connection to remote 
regionserver (for the purpose of region assign or unassign), keep the current 
logic of retrying.
 # However, rather than infinite retries, keep num of retries limited (make it 
configurable).
 # When retries are exhausted, schedule force SCP without deleting the 
ephemeral rs ZNode. Whether we schedule force SCP (HbckSCP) or normal SCP is 
something we can discuss. I believe normal SCP should be fine too.

 

For canary based improvement, maybe we can create separate Jira, if that's fine 
with you [~zhangduo]?


was (Author: vjasani):
{quote}force scheduling a SCP is the correct way to recover
{quote}
Just to be clear, "force SCP" is nothing but HBCKSCP.
{quote}For me, I think maybe we could add a new feature to hbase Canary? Once a 
canary detects that a regionserver is in trouble, i.e, not responding, or very 
very slow, we could try to kill/restart the region server.
{quote}
That's good solution as discussed on parent Jira also, but i think we can keep 
this improvement as separate Jira than tightly couple it with the current 
issues.

 

For the given issue (or series of issues), what we can do is:
 # Whenever any TRSP gets stuck while making RPC connection to remote 
regionserver (for the purpose of region assign or unassign), keep the current 
logic of retrying.
 # However, rather than infinite retries, keep num of retries limited (make it 
configurable).
 # When retries are exhausted, schedule force SCP without deleting the 
ephemeral rs ZNode. Whether we schedule force SCP (HbckSCP) or normal SCP is 
something we can discuss. I believe normal SCP should be fine too.

> RSProcedureDispatcher to fail-fast for connection closed errors
> ---------------------------------------------------------------
>
>                 Key: HBASE-28638
>                 URL: https://issues.apache.org/jira/browse/HBASE-28638
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.5.8
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>             Fix For: 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> As per one of the recent incidents, some regions faced 5+ minute of 
> availability drop because before active master could initiate SCP for the 
> dead server, some region moves tried to assign regions on the already dead 
> regionserver. Sometimes, due to transient issues, we see that active master 
> gets notified after few minutes (5+ minute in this case).
> {code:java}
> 2024-05-08 03:47:38,518 WARN  [RSProcedureDispatcher-pool-4790] 
> procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed 
> due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
> address=host1:61020 failed on local exception: 
> org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection 
> closed, try=0, retrying... {code}
> And as we know, we have infinite retries here, so it kept going on..
>  
> Eventually, SCP could be initiated only after active master discovered the 
> server as dead:
> {code:java}
> 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer - 
> Processing host1,61020,1713411866443; numProcessing=1
> 2024-05-08 03:50:01,038 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker - RegionServer ephemeral node deleted, processing 
> expiration [host1,61020,1713411866443] {code}
> leading to
> {code:java}
> 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833] 
> assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691, 
> state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51, 
> server=host1,61020,1713411866443 for region state=OPENING, 
> location=host1,61020,1713411866443, table=T1, 
> region=5cafbe54d5685acc6c4866758e67fd51, targetServer 
> host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code}
> This entire duration of outage could be avoided if we can fail-fast for 
> connection drop errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-28638) RSProcedureDispatcher to fail-fast for connection closed errors

Reply via email to