Hi ZK Devs,

at Midokura we use ZK to store/manage/propagate some replicated sets and
are running into (expected) scalability and performance limits.

My understanding is that ZK's model of notifying "object X changed" (as
opposed to "change Y was applied to object X") and hence forcing
getChildren watchers to reload the entire child set is motivated by
simplicity and avoiding keeping state per client - ZK server would have to
buffer changes for temporarily disconnected clients.

I think the argument is sound for non-child data, but I think providing
child diffs on a best-effort basis should be both easy for the ZK server
and in line with many use-cases/recipes. We're about to investigate the
feasibility of such a design and tackle it ourselves, but I wanted to reach
out to the community to ask whether someone else has thought about this,
whether there's some fundamental reason not to implement it, and any advice
if we attempt it.

Specifically, a getChildren watcher would receive two kinds of
notifications:

   1. Simple "updated" without details, already provided today.
   2. The new notification that passes a 2-tuple (Set<String> added,
   Set<String> removed) - the Strings are individual child names under the
   watched path (e.g. "proc123" under "/zk/mylocks/")

Upon receiving and applying an update/transaction for a child set from the
leader, a ZK Server can easily compute the diff and send it to all
healthy/connected clients that are watching the parent - type 2
notification. Since the diffs are not buffered, any client that reconnects
(before its session expires) will simply be told that the child set changed
- type 1 notification. That's why these are "best-effort diffs for child
sets".

In the great majority of cases, most clients will be kept up to date by the
diffs, and only occasionally would clients need to re-read the entire child
list, this reducing the frequency of stampedes on the child directory and
making recipe-writing easier.

I look forward to hearing your thoughts and will try to also get back to
you with implementation-specific details.

best,
Pino

Reply via email to