[
https://issues.apache.org/jira/browse/HBASE-25650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Wang updated HBASE-25650:
------------------------------
Priority: Minor (was: Major)
> Reduce MTTR for region server
> -----------------------------
>
> Key: HBASE-25650
> URL: https://issues.apache.org/jira/browse/HBASE-25650
> Project: HBase
> Issue Type: Brainstorming
> Components: master, regionserver
> Affects Versions: 1.4.13
> Reporter: Andy Wang
> Priority: Minor
>
> I some cases in our production that, the machine that runs region server is
> not functioning well(I could not ssh to that machine, but it respond ping
> requests), the Region Server process is still running but could not process
> client requests. It lasts for more than 30 minutes util I remove the znode of
> that Region Server from ZK manually. That RS is totally unavailable during
> that time.
> I guess Region Server still heartbeats to ZK so that the ephemeral node of
> the RS is not removed by ZK, master does not find that this RS has down.
>
> I think hbase needs a better failure detection except for watching the
> existence of the ephemeral node created by RS.
> One thing comes to my mind is running a failure detection( like [The Φ
> Accrual Failure Detector
> (computer.org)|https://www.computer.org/csdl/proceedings-article/srds/2004/22390066/12OmNvT2phv])
> service on master which pings RS periodically so that the master could know
> the RS is down asap.
>
> Any ideas?
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)