stack created HBASE-8748:
----------------------------

             Summary: Be able to accomodate zookeeper going away for a minute 
or two -- or more
                 Key: HBASE-8748
                 URL: https://issues.apache.org/jira/browse/HBASE-8748
             Project: HBase
          Issue Type: Brainstorming
          Components: Zookeeper
            Reporter: stack


I was talking w/ Christophe Taton yesterday and he asked what happens if 
zookeeper goes away for a minute or two -- say a network or ensemble hiccup of 
some type -- then what happens?

Unless the ensemble comes back inside the zk session timeout, the cluster will 
go down.

To my knowledge, zk has hiccuped a few times.  There was the bug where sequence 
numbers rolled around the top causing the ensemble to blip (fixed in a newer 
zk).  There was another event where <speculation>some combination of a leader 
election and accumulated log files (>100k)</speculation> caused the ensemble 
blip at SU.  

At FB apparently the zk session is way up -- > 5minutes -- in case a 
top-of-the-rack switch reboots partitioning the network separating nodes from 
the zk ensemble and rather than rely on presence of ephemeral nodes, rather, 
they depend on heartbeats to determine presence or not of a regionserver (w/ 
some smarts so that if all members of a rack disappear at the same time, it is 
not likely they all crashed at same time).

I am stating the obvious I know but the base presumption that zk will just 
always be there is lazy on our part and we should not be acting as though it 
were.

Marking this a brainstorming issue because will need a bit of discussion/design 
undoing our current presumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to