[
https://issues.apache.org/jira/browse/HBASE-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Stack resolved HBASE-8748.
----------------------------------
Resolution: Won't Fix
Stale. Context is different now.
> Be able to accomodate zookeeper going away for a minute or two -- or more
> -------------------------------------------------------------------------
>
> Key: HBASE-8748
> URL: https://issues.apache.org/jira/browse/HBASE-8748
> Project: HBase
> Issue Type: Brainstorming
> Components: Zookeeper
> Reporter: Michael Stack
> Priority: Major
>
> I was talking w/ Christophe Taton yesterday and he asked what happens if
> zookeeper goes away for a minute or two -- say a network or ensemble hiccup
> of some type -- then what happens?
> Unless the ensemble comes back inside the zk session timeout, the cluster
> will go down.
> To my knowledge, zk has hiccuped a few times. There was the bug where
> sequence numbers rolled around the top causing the ensemble to blip (fixed in
> a newer zk). There was another event where <speculation>some combination of
> a leader election and accumulated log files (>100k)</speculation> caused the
> ensemble blip at SU.
> At FB apparently the zk session is way up -- > 5minutes -- in case a
> top-of-the-rack switch reboots partitioning the network separating nodes from
> the zk ensemble and rather than rely on presence of ephemeral nodes, rather,
> they depend on heartbeats to determine presence or not of a regionserver (w/
> some smarts so that if all members of a rack disappear at the same time, it
> is not likely they all crashed at same time).
> I am stating the obvious I know but the base presumption that zk will just
> always be there is lazy on our part and we should not be acting as though it
> were.
> Marking this a brainstorming issue because will need a bit of
> discussion/design undoing our current presumption.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)