@lindong28 From a design and code readability perspective, I agree with what you have proposed (First atomic read `\controller_epoch` and create `\controller`, then update `\controller_epoch`). From the implementation perspective, zookeeper does not have a `read` Op meaning that we cannot perform `read` operation with the `multi` (see http://people.apache.org/~larsgeorge/zookeeper-1215258/build/docs/dev-api/org/apache/zookeeper/Op.html).
Basically, we use the time when a broker succeeds in incrementing the controller epoch as the "commit" point of the controller election and use the time when a broker succeeds in creating `\controller` znode as the "prepare" point. So for the correctness of the controller election "commit", we need to ensure `\controller_epoch` doesn't change from "prepare" to "commit". To achieve, we can implement the logic using zk `multi` following the steps: 1. Read `\controller_epoch` to get the current controller epoch **e1** with zkVersion **v1** 2. Create `\controller` if `\controller_epoch` zkVersion matches **v1** (use zk `multi`) 3. Update `\controller_epoch` to be **e1+1** if its zkVersion matches **v1** (zk conditional set) [ Full content available at: https://github.com/apache/kafka/pull/5101 ] This message was relayed via gitbox.apache.org for [email protected]
