@lindong28 From a design and code readability perspective, I agree with what 
you have proposed (First atomic read `\controller_epoch` and create 
`\controller`, then update `\controller_epoch`). From the implementation 
perspective, zookeeper does not have a `read` Op meaning that we cannot perform 
`read` operation with the `multi` (see 
http://people.apache.org/~larsgeorge/zookeeper-1215258/build/docs/dev-api/org/apache/zookeeper/Op.html).

Basically, we use the time when a broker succeeds in incrementing the 
controller epoch as the "commit" point of the controller election and use the 
time when a broker succeeds in creating `\controller` znode as the "prepare" 
point. So for the correctness of the controller election "commit", we need to 
ensure `\controller_epoch` doesn't change from "prepare" to "commit". To 
achieve, we can implement the logic using zk `multi` following the steps:
1. Read `\controller_epoch` to get the current controller epoch **e1** with 
zkVersion **v1**
2. Create `\controller` if `\controller_epoch` zkVersion matches **v1** (use zk 
`multi`)
3. Update `\controller_epoch` to be **e1+1** if its zkVersion matches **v1** 
(zk conditional set) 

[ Full content available at: https://github.com/apache/kafka/pull/5101 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to