[
https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852560#comment-17852560
]
Viraj Jasani commented on HBASE-28638:
--------------------------------------
{quote}I still think the root problem here is why we can only schedule SCP
after 5 minutes...
{quote}
I will try to find Zk logs but this incident happened sometime back and we
might soon loose some of those logs so it's not guaranteed that we will be able
to find root cause of zookeeper issue.
In the above case, we also don't see any regionserver stop or abort logs, all
of a sudden we see no logs available for 5 minutes of duration (in this period
of time, OS was getting upgraded) from the regionserver. I have been thinking
about this as it's not certain what exactly happened, was the server killed or
was it abruptly stopped in a way that somehow Zookeeper connection was stable
but otherwise the server was not stable for serving any client requests? or was
it killed and Zookeeper connection was also lost but somehow ZNode removal
watcher notification came late to active master due to some bug? These are some
questions, but i am not sure if we will have access to old logs.
1. However, if we think about this whole situation, one thing we should
prevent: any region which is not hosted on the unstable regionserver should not
suffer downtime just because active master decided to keep the unstable server
as it's destination. We need some level of protection for such kind of
availability loss.
2. On the other hand, the regions hosted by the unstable regionserver would
suffer availability loss until active master can schedule SCP, which has
dependency on reliability of zookeeper watcher notifications. Probably some
fine tuning to make this more stable, but any possibility of code bugs in
zookeeper might also lead to delayed notification.
While the second concern has dependency on Zookeeper connection reliability,
the first one should be active master's responsibility IMO.
WDYT [~zhangduo] [~apurtell]?
> RSProcedureDispatcher to fail-fast for connection closed errors
> ---------------------------------------------------------------
>
> Key: HBASE-28638
> URL: https://issues.apache.org/jira/browse/HBASE-28638
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 2.5.8
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
> Fix For: 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> As per one of the recent incidents, some regions faced 5+ minute of
> availability drop because before active master could initiate SCP for the
> dead server, some region moves tried to assign regions on the already dead
> regionserver. Sometimes, due to transient issues, we see that active master
> gets notified after few minutes (5+ minute in this case).
> {code:java}
> 2024-05-08 03:47:38,518 WARN [RSProcedureDispatcher-pool-4790]
> procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed
> due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to
> address=host1:61020 failed on local exception:
> org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection
> closed, try=0, retrying... {code}
> And as we know, we have infinite retries here, so it kept going on..
>
> Eventually, SCP could be initiated only after active master discovered the
> server as dead:
> {code:java}
> 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer -
> Processing host1,61020,1713411866443; numProcessing=1
> 2024-05-08 03:50:01,038 INFO [RegionServerTracker-0]
> master.RegionServerTracker - RegionServer ephemeral node deleted, processing
> expiration [host1,61020,1713411866443] {code}
> leading to
> {code:java}
> 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833]
> assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691,
> state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51,
> server=host1,61020,1713411866443 for region state=OPENING,
> location=host1,61020,1713411866443, table=T1,
> region=5cafbe54d5685acc6c4866758e67fd51, targetServer
> host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code}
> This entire duration of outage could be avoided if we can fail-fast for
> connection drop errors.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)