Sergey Shelukhin created HBASE-21744:
----------------------------------------
Summary: timeout for server list refresh calls
Key: HBASE-21744
URL: https://issues.apache.org/jira/browse/HBASE-21744
Project: HBase
Issue Type: Bug
Reporter: Sergey Shelukhin
Not sure why yet, but we are seeing the case when cluster is in overall a bad
state, where after RS dies and deletes its znode, the notification looks like
it's lost, so the master doesn't detect the failure. ZK itself appears to be
healthy and doesn't report anything special.
After some other change is made to the server list, master rescans the list and
picks up the stale notification. Might make sense to add a config that would
trigger the refresh if it hasn't happened for a while (e.g. 1 minute).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)