[ 
https://issues.apache.org/jira/browse/HBASE-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-23206:
----------------------------------------
    Description: 
We have faced a few production issues where the reliability of the ZooKeeper 
quorum serving the cluster has not been as robust as expected. The most recent 
one was essentially ZOOKEEPER-2164 (and related: ZOOKEEPER-900). These can be 
mitigated by a ZK server configuration change but the incidents suggest it may 
be worth thinking about how to be less reliant on the service provided by a 
single ZK quorum instance. 

A solution would be holistic with several parts:
- HBASE-18095 to get ZK dependencies out of the client
- Related HBase replication improvements to track peer and position state in 
HBase tables instead of znodes
- This brainstorming...

For this part, RecoverableZooKeeper (RZK) might be taught how to speak to two 
separate ZK quorum redundantly, and continue to offer service even if one of 
them is temporarily unable to provide service. 

  was:
We have faced a few production issues where the reliability of the ZooKeeper 
quorum serving the cluster has not been as robust as expected. The most recent 
one was essentially ZOOKEEPER-2164 (and related: ZOOKEEPER-900). These can be 
mitigated by a ZK server configuration change but the incidents suggest it may 
be worth thinking about how to be less reliant on the service provided by a 
single ZK quorum instance. 

A solution would be holistic with several parts:
- HBASE-18095 to get ZK dependencies out of the client
- Related HBase replication improvements to track peer and position state in 
HBase tables instead of znodes
- This brainstorming...

For this part, we could consider the possibility that RecoverableZooKeeper 
(RZK) might be taught how to speak to two separate ZK quorum redundantly, and 
continue to offer service even if one of them is temporarily unable to provide 
service. 


> ZK quorum redundancy with failover in RZK
> -----------------------------------------
>
>                 Key: HBASE-23206
>                 URL: https://issues.apache.org/jira/browse/HBASE-23206
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> We have faced a few production issues where the reliability of the ZooKeeper 
> quorum serving the cluster has not been as robust as expected. The most 
> recent one was essentially ZOOKEEPER-2164 (and related: ZOOKEEPER-900). These 
> can be mitigated by a ZK server configuration change but the incidents 
> suggest it may be worth thinking about how to be less reliant on the service 
> provided by a single ZK quorum instance. 
> A solution would be holistic with several parts:
> - HBASE-18095 to get ZK dependencies out of the client
> - Related HBase replication improvements to track peer and position state in 
> HBase tables instead of znodes
> - This brainstorming...
> For this part, RecoverableZooKeeper (RZK) might be taught how to speak to two 
> separate ZK quorum redundantly, and continue to offer service even if one of 
> them is temporarily unable to provide service. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to