jiazhai commented on a change in pull request #1113: BP-28: Etcd as metadata 
URL: https://github.com/apache/bookkeeper/pull/1113#discussion_r165800469

 File path: site/bps/BP-28-etcd-as-metadata-store.md
 @@ -0,0 +1,102 @@
+title: "BP-28: use etcd as metadata store"
+issue: https://github.com/apache/bookkeeper/<issue-number>
+state: 'Under Discussion'
+release: "N/A"
+### Motivation
+Currently bookkeeper uses zookeeper as the metadata store. However there is a 
couple of issues with current approach, especially using zookeeper.
+These issues includes:
+1. You need to allocate special nodes for zookeeper. These nodes need to be 
treated specially, and have their own monitoring.
+   Ops need to understand both bookies and zookeeper.
+2. ZooKeeper is the scalability bottleneck. ZooKeeper doesn?t scale writes as 
you add nodes. This means that if your bookkeeper
+   cluster reaches the maximum write throughput that ZK can sustain, you?ve 
reached the maximum capacity of your cluster, and there?s nothing you
+   can do (except buy bigger hardware for your special nodes).
+3. ZooKeeper enforces you into its programming model. In general, its 
programming model is not too bad. However it becomes problematic when
+   the scale goes up (e.g. the number of clients and watcher increase). The 
issues usually comes from _session expires_ and _watcher_.
+  - *Session Expires*: For simplicity, ZooKeeper ties session state directly 
with connection state. So when a connection is broken, a session is usually 
expired (unless it reconnects before session expires), and when a session is 
expired, the underlying connection can not be used anymore, the application has 
to close the connection and re-establish a new client (a new connection). It is 
understandable that it makes zookeeper development easy. However in reality, it 
means if you can not establish a session, you can?t use this connection and you 
have to create new connections. Once your zookeeper cluster is in a bad state 
(e.g. network issue or jvm gc), the whole cluster is usually unable to recover 
because of the connection storm introduced by session expires.
+  - *Watchers*: The zookeeper watcher is one time watcher, applications can?t 
reliably use it to get updates. In order to set a watcher, you have to read a 
znode or get children. Imagine such a use case, clients are watching a list of 
znodes (e.g. list of bookies), when those clients expire, they have to get the 
list of znodes in order to rewatch the list, even the list is never changed.
+  - The combination of session expires and watchers is often the root cause of 
critical zookeeper outages.
+This proposal is to explore other existing systems such as etcd as the 
metadata store. Using Etcd doesn't address concerns #1, however it might 
+address concern #2 and #3 to some extend. And if you are running bookkeeper in 
k8s, there is already an Etcd instance available. It can become easier to run
+bookkeeper on k8s if we can use Etcd as the metadata store.
 Review comment:
   nit: seems bring some line breaks because of copy paste?

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

Reply via email to