[ 
https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852560#comment-17852560
 ] 

Viraj Jasani commented on HBASE-28638:
--------------------------------------

{quote}I still think the root problem here is why we can only schedule SCP 
after 5 minutes...
{quote}
I will try to find Zk logs but this incident happened sometime back and we 
might soon loose some of those logs so it's not guaranteed that we will be able 
to find root cause of zookeeper issue.

In the above case, we also don't see any regionserver stop or abort logs, all 
of a sudden we see no logs available for 5 minutes of duration (in this period 
of time, OS was getting upgraded) from the regionserver. I have been thinking 
about this as it's not certain what exactly happened, was the server killed or 
was it abruptly stopped in a way that somehow Zookeeper connection was stable 
but otherwise the server was not stable for serving any client requests? or was 
it killed and Zookeeper connection was also lost but somehow ZNode removal 
watcher notification came late to active master due to some bug? These are some 
questions, but i am not sure if we will have access to old logs.

 

1. However, if we think about this whole situation, one thing we should 
prevent: any region which is not hosted on the unstable regionserver should not 
suffer downtime just because active master decided to keep the unstable server 
as it's destination. We need some level of protection for such kind of 
availability loss.

2. On the other hand, the regions hosted by the unstable regionserver would 
suffer availability loss until active master can schedule SCP, which has 
dependency on reliability of zookeeper watcher notifications. Probably some 
fine tuning to make this more stable, but any possibility of code bugs in 
zookeeper might also lead to delayed notification.

While the second concern has dependency on Zookeeper connection reliability, 
the first one should be active master's responsibility IMO.

 

WDYT [~zhangduo] [~apurtell]?

> RSProcedureDispatcher to fail-fast for connection closed errors
> ---------------------------------------------------------------
>
>                 Key: HBASE-28638
>                 URL: https://issues.apache.org/jira/browse/HBASE-28638
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.5.8
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>             Fix For: 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> As per one of the recent incidents, some regions faced 5+ minute of 
> availability drop because before active master could initiate SCP for the 
> dead server, some region moves tried to assign regions on the already dead 
> regionserver. Sometimes, due to transient issues, we see that active master 
> gets notified after few minutes (5+ minute in this case).
> {code:java}
> 2024-05-08 03:47:38,518 WARN  [RSProcedureDispatcher-pool-4790] 
> procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed 
> due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
> address=host1:61020 failed on local exception: 
> org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection 
> closed, try=0, retrying... {code}
> And as we know, we have infinite retries here, so it kept going on..
>  
> Eventually, SCP could be initiated only after active master discovered the 
> server as dead:
> {code:java}
> 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer - 
> Processing host1,61020,1713411866443; numProcessing=1
> 2024-05-08 03:50:01,038 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker - RegionServer ephemeral node deleted, processing 
> expiration [host1,61020,1713411866443] {code}
> leading to
> {code:java}
> 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833] 
> assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691, 
> state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51, 
> server=host1,61020,1713411866443 for region state=OPENING, 
> location=host1,61020,1713411866443, table=T1, 
> region=5cafbe54d5685acc6c4866758e67fd51, targetServer 
> host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code}
> This entire duration of outage could be avoided if we can fail-fast for 
> connection drop errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to