Hi, During leader switches, we observe connection imbalance among our observers, leading to some observers becoming overloaded with a large number of connections, which disrupts our capacity estimates.
Further investigation revealed that during leader election in Apache ZooKeeper, two concurrent types of threads operate within the leader node: - QuorumPeer thread which manages the quorum protocol. - The LearnerHandler thread which handles synchronization of the leader with learners. Because these threads operate simultaneously, during the leader's bootstrap process, some observers may achieve synchronization with the leader before leader transitions to broadcast state. Meanwhile, others synchronize after the leader’s broadcast state. For example, let’s say 14, 65, 119, 82 and 110 are different server IDs of learners. 1. Initially, all 14, 65, 119, 82 and 110 are in sync with the leader ( peerLastZxid=0x2b300005cf4). 2. Leader switch happens. 3. New leader sends empty DIFF to 14, 65, 119 (since lastProcessedZxid==peerLastZxid). 4. 14, 65 and 119 start accepting connections. 5. Leader’s epoch increases, transitions to broadcast state and maxCommittedLog changes from 0x2b300005cf4 to 0x2b400000032 6. Leader sends commitedLog DIFF to 82 and 110 since their zxid (peerLastZxid) is 0x2b300005cf4. 7. 82 and 110 transitions to broadcast and start accepting connections much later than 14, 65, 119 leading to connection imbalance. Logs: 2024-06-25 06:53:17,508 [myid:] - INFO [LearnerHandler-/10.155.16.87:57098:?@?] - Synchronizing with Learner sid: 14 maxCommittedLog=0x2b300005cf4 minCommittedLog=0x2b300005b00 lastProcessedZxid=0x2b300005cf4 peerLastZxid=0x2b300005cf4 2024-06-25 06:53:17,508 [myid:] - INFO [LearnerHandler-/10.155.23.246:44712:?@?] - Synchronizing with Learner sid: 65 maxCommittedLog=0x2b300005cf4 minCommittedLog=0x2b300005b00 lastProcessedZxid=0x2b300005cf4 peerLastZxid=0x2b300005cf4 2024-06-25 06:53:17,508 [myid:] - INFO [LearnerHandler-/10.155.178.245:49152:?@?] - Synchronizing with Learner sid: 119 maxCommittedLog=0x2b300005cf4 minCommittedLog=0x2b300005b00 lastProcessedZxid=0x2b300005cf4 peerLastZxid=0x2b300005cf4 —----------------------------- 2024-06-25 06:53:18,467 [myid:] - INFO [QuorumPeer[myid=4](plain=[0:0:0:0:0:0:0:0]:12913)(secure=[0:0:0:0:0:0:0:0]:12912):?@?] - Peer state changed: leading - broadcast —----------------------------- 2024-06-25 06:53:23,071 [myid:] - INFO [LearnerHandler-/10.199.145.252:59972:?@?] - On disk txn sync enabled with snapshotSizeFactor 0.33 2024-06-25 06:53:23,071 [myid:] - INFO [LearnerHandler-/10.199.145.252:59972:?@?] - Synchronizing with Learner sid: 110 maxCommittedLog=0x2b400000032 minCommittedLog=0x2b300005b32 lastProcessedZxid=0x2b400000032 peerLastZxid=0x2b300005cf4 2024-06-25 06:53:23,071 [myid:] - INFO [LearnerHandler-/10.199.145.252:59972:?@?] - Using committedLog for peer sid: 110 2024-06-25 06:53:23,072 [myid:] - INFO [LearnerHandler-/10.155.180.220:46648:?@?] - On disk txn sync enabled with snapshotSizeFactor 0.33 2024-06-25 06:53:23,072 [myid:] - INFO [LearnerHandler-/10.155.180.220:46648:?@?] - Synchronizing with Learner sid: 82 maxCommittedLog=0x2b400000032 minCommittedLog=0x2b300005b32 lastProcessedZxid=0x2b400000032 peerLastZxid=0x2b300005cf4 2024-06-25 06:53:23,072 [myid:] - INFO [LearnerHandler-/10.155.180.220:46648:?@?] - Using committedLog for peer sid: 82 Questions: 1. Why does new leader start synchronization (via empty DIFF) with some observers (14, 65, 119) before others (82, 110)? 2. Can all synchronization start before or after the leader's epoch is incremented and changes to broadcast state? Why is the current behavior not this way? Regards, Abhilash