[jira] [Commented] (HBASE-10296) Replace ZK with a consensus lib(paxos,zab or raft) running within master processes to provide better master failover performance and state consistency

Feng Honghua (JIRA) Thu, 13 Feb 2014 02:47:25 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900208#comment-13900208
 ]


Feng Honghua commented on HBASE-10296:
--------------------------------------

bq.Zookeeper is used also by HDFS, Kafka, Storm as well as several other 
systems. Is it realistic (or desirable) to assume it would go away (from an 
operations standpoint)? (With raft-go there's etcd, for example).
Not quite. For applications which just need a simple/reliable storage for 
storing small amount of configuration or meta like data with sparse access, 
Zookeeper still has advantage over raft-based solution:
# Economical: Zookeeper can be shared among a big number of applications with 
such simple storage requirement, but for raft-based solution each application 
need to allocate its own separate 3-5 nodes for replication purpose.
# Simple: Application code is simple by just calling Zookeeper API to 
create/read/write node/data to/from Zookeeper, while raft-based solution need 
to write more complex interacting code between application code and raft 
library such as passing/converting raft write/log to in-memory data structure 
with application-specific meaning, and snapshot making and log truncate, etc.
# Convenient: Zookeeper's tree-like hierarchical structure for organizing data 
and watch/notify mechanism is convenient for application to represent data and 
organize code, as long as watch/notify mechanism is not used to implement 
state-machine-like logic with the 'A process changes a znode, B process watches 
that znode and then reads the znode value to trigger its state-machine' pattern
In short, raft-based solution is somewhat an overkill for such applications 
with simple, small and sparse-access storage requirement.

bq.Will a library based approach simplify the code overall or make it easier to 
understand? it seems that it will make at least some parts more complex. What 
aspects of the system will be improved by the lower latencies? I'm not really 
clear on the faster master failover benefit. Will this improve region 
reassignment in a manner that could not be achieved without it?
For HMaster, raft-based approach has below benefits:
# For assign(split/merge) state machine logic, raft-based approach eliminates 
the potentials for state inconsistency. HMaster's current implementation 
suffers from two facts which can result in consistency issues: 1) Zookeeper's 
watch/notify mechanism is used to maintain the assign state machine; 2) assign 
status is stored in multiple places(master's memory, Zookeeper), so it always 
has the headache to guarantee the data consistency among those different places
# Better master failover performance. New master can immediately play as active 
master after previous active one dies, without first reading from external 
storage to rebuild in-memory state(current HBase's approach) or querying from 
regionservers and rebuild the in-memory state about the cluster(Bigtable's 
approach, personally I think Bigtable's master startup code should be even more 
complicated than HBase since it needs to reason out the correct 'cluster state' 
by response from regionservers, not say regionservers can fail during master 
startup process...)
# Better whole-cluster restart performance. For cluster with big number of 
regions(say 10K-100K), during the cluster restart master need to do assignment 
for all the regions, hence result in access to Zookeeper in a very frequent 
fashion, due to the fact that only a single IO thread and a single event thread 
are used by master to communicate with Zookeeper, the interaction with 
Zookeeper can be an obvious bottleneck for the cluster restart, while 
raft-based approach can perform much better here.
# Simpler deployment. HBase with raft-based approach's deployment is '3 master 
+ n regionserver', while Zookeeper solution is ' 3 Zookeeper + 2+ master + n 
regionserver'. We can't assume applications running HBase can always find a 
shared Zookeeper to use.
# Isolation. Zookeeper-approach HBase cluster can be affected by other 
applications which may slow down or even turn down by abusing or misusing the 
shared Zookeeper that our HBase relies on, while raft-based doesn't need to 
worry about this.

> Replace ZK with a consensus lib(paxos,zab or raft) running within master 
> processes to provide better master failover performance and state consistency
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10296
>                 URL: https://issues.apache.org/jira/browse/HBASE-10296
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: master, Region Assignment, regionserver
>            Reporter: Feng Honghua
>
> Currently master relies on ZK to elect active master, monitor liveness and 
> store almost all of its states, such as region states, table info, 
> replication info and so on. And zk also plays as a channel for 
> master-regionserver communication(such as in region assigning) and 
> client-regionserver communication(such as replication state/behavior change). 
> But zk as a communication channel is fragile due to its one-time watch and 
> asynchronous notification mechanism which together can leads to missed 
> events(hence missed messages), for example the master must rely on the state 
> transition logic's idempotence to maintain the region assigning state 
> machine's correctness, actually almost all of the most tricky inconsistency 
> issues can trace back their root cause to the fragility of zk as a 
> communication channel.
> Replace zk with paxos running within master processes have following benefits:
> 1. better master failover performance: all master, either the active or the 
> standby ones, have the same latest states in memory(except lag ones but which 
> can eventually catch up later on). whenever the active master dies, the newly 
> elected active master can immediately play its role without such failover 
> work as building its in-memory states by consulting meta-table and zk.
> 2. better state consistency: master's in-memory states are the only truth 
> about the system,which can eliminate inconsistency from the very beginning. 
> and though the states are contained by all masters, paxos guarantees they are 
> identical at any time.
> 3. more direct and simple communication pattern: client changes state by 
> sending requests to master, master and regionserver talk directly to each 
> other by sending request and response...all don't bother to using a 
> third-party storage like zk which can introduce more uncertainty, worse 
> latency and more complexity.
> 4. zk can only be used as liveness monitoring for determining if a 
> regionserver is dead, and later on we can eliminate zk totally when we build 
> heartbeat between master and regionserver.
> I know this might looks like a very crazy re-architect, but it deserves deep 
> thinking and serious discussion for it, right?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HBASE-10296) Replace ZK with a consensus lib(paxos,zab or raft) running within master processes to provide better master failover performance and state consistency

Reply via email to