[ 
https://issues.apache.org/jira/browse/HBASE-25650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Wang updated HBASE-25650:
------------------------------
    Priority: Minor  (was: Major)

> Reduce MTTR for region server
> -----------------------------
>
>                 Key: HBASE-25650
>                 URL: https://issues.apache.org/jira/browse/HBASE-25650
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: master, regionserver
>    Affects Versions: 1.4.13
>            Reporter: Andy Wang
>            Priority: Minor
>
> I some cases in our production that, the machine that runs region server is 
> not functioning well(I could not ssh to that machine, but it respond ping 
> requests), the Region Server process is still running but could not process 
> client requests. It lasts for more than 30 minutes util I remove the znode of 
> that Region Server from ZK manually. That RS is totally unavailable during 
> that time.
> I guess Region Server  still heartbeats to ZK so that the ephemeral node of 
> the RS is not removed by ZK, master does not find that this RS has down.
>  
> I think hbase needs a better failure detection except for watching the 
> existence of the ephemeral node created by RS. 
> One thing comes to my mind is running a failure detection( like  [The Φ 
> Accrual Failure Detector 
> (computer.org)|https://www.computer.org/csdl/proceedings-article/srds/2004/22390066/12OmNvT2phv])
>   service on master which pings RS periodically so that the master could know 
> the RS is down asap.
>  
> Any ideas?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to