[
https://issues.apache.org/jira/browse/HBASE-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell updated HBASE-23206:
----------------------------------------
Description:
We have faced a few production issues where the reliability of the ZooKeeper
quorum serving the cluster has not been as robust as expected. The most recent
one was essentially ZOOKEEPER-2164 (and related: ZOOKEEPER-900). These can be
mitigated by a ZK server configuration change but the incidents suggest it may
be worth thinking about how to be less reliant on the service provided by a
single ZK quorum instance.
A solution would be holistic with several parts:
- HBASE-18095 to get ZK dependencies out of the client
- Related HBase replication improvements to track peer and position state in
HBase tables instead of znodes
- This brainstorming...
For this part, RecoverableZooKeeper (RZK) might be taught how to speak to two
separate ZK quorum redundantly, and continue to offer service even if one of
them is temporarily unable to provide service.
was:
We have faced a few production issues where the reliability of the ZooKeeper
quorum serving the cluster has not been as robust as expected. The most recent
one was essentially ZOOKEEPER-2164 (and related: ZOOKEEPER-900). These can be
mitigated by a ZK server configuration change but the incidents suggest it may
be worth thinking about how to be less reliant on the service provided by a
single ZK quorum instance.
A solution would be holistic with several parts:
- HBASE-18095 to get ZK dependencies out of the client
- Related HBase replication improvements to track peer and position state in
HBase tables instead of znodes
- This brainstorming...
For this part, we could consider the possibility that RecoverableZooKeeper
(RZK) might be taught how to speak to two separate ZK quorum redundantly, and
continue to offer service even if one of them is temporarily unable to provide
service.
> ZK quorum redundancy with failover in RZK
> -----------------------------------------
>
> Key: HBASE-23206
> URL: https://issues.apache.org/jira/browse/HBASE-23206
> Project: HBase
> Issue Type: Brainstorming
> Reporter: Andrew Kyle Purtell
> Priority: Major
>
> We have faced a few production issues where the reliability of the ZooKeeper
> quorum serving the cluster has not been as robust as expected. The most
> recent one was essentially ZOOKEEPER-2164 (and related: ZOOKEEPER-900). These
> can be mitigated by a ZK server configuration change but the incidents
> suggest it may be worth thinking about how to be less reliant on the service
> provided by a single ZK quorum instance.
> A solution would be holistic with several parts:
> - HBASE-18095 to get ZK dependencies out of the client
> - Related HBase replication improvements to track peer and position state in
> HBase tables instead of znodes
> - This brainstorming...
> For this part, RecoverableZooKeeper (RZK) might be taught how to speak to two
> separate ZK quorum redundantly, and continue to offer service even if one of
> them is temporarily unable to provide service.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)