Andy Wang created HBASE-25650:
---------------------------------

             Summary: Reduce MTTR for region server
                 Key: HBASE-25650
                 URL: https://issues.apache.org/jira/browse/HBASE-25650
             Project: HBase
          Issue Type: Brainstorming
          Components: master, regionserver
    Affects Versions: 1.4.13
            Reporter: Andy Wang


I some cases in our production that, the machine that runs region server is not 
functioning well(I could not ssh to that machine, but it respond ping 
requests), the Region Server process is still running but could not process 
client requests. It lasts for more than 30 minutes util I remove the znode of 
that Region Server from ZK manually. That RS is totally unavailable during that 
time.

I guess Region Server  still heartbeats to ZK so that the ephemeral node of the 
RS is not removed by ZK, master does not find that this RS has down.

 

I think hbase needs a better failure detection except for watching the 
existence of the ephemeral node created by RS. 

One thing comes to my mind is running a failure detection( like  [The Φ Accrual 
Failure Detector 
(computer.org)|https://www.computer.org/csdl/proceedings-article/srds/2004/22390066/12OmNvT2phv])
  service on master which pings RS periodically so that the master could know 
the RS is down asap.

 

Any ideas?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to