stack created HBASE-8748:
----------------------------
Summary: Be able to accomodate zookeeper going away for a minute
or two -- or more
Key: HBASE-8748
URL: https://issues.apache.org/jira/browse/HBASE-8748
Project: HBase
Issue Type: Brainstorming
Components: Zookeeper
Reporter: stack
I was talking w/ Christophe Taton yesterday and he asked what happens if
zookeeper goes away for a minute or two -- say a network or ensemble hiccup of
some type -- then what happens?
Unless the ensemble comes back inside the zk session timeout, the cluster will
go down.
To my knowledge, zk has hiccuped a few times. There was the bug where sequence
numbers rolled around the top causing the ensemble to blip (fixed in a newer
zk). There was another event where <speculation>some combination of a leader
election and accumulated log files (>100k)</speculation> caused the ensemble
blip at SU.
At FB apparently the zk session is way up -- > 5minutes -- in case a
top-of-the-rack switch reboots partitioning the network separating nodes from
the zk ensemble and rather than rely on presence of ephemeral nodes, rather,
they depend on heartbeats to determine presence or not of a regionserver (w/
some smarts so that if all members of a rack disappear at the same time, it is
not likely they all crashed at same time).
I am stating the obvious I know but the base presumption that zk will just
always be there is lazy on our part and we should not be acting as though it
were.
Marking this a brainstorming issue because will need a bit of discussion/design
undoing our current presumption.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira