zymap opened a new pull request, #4727: URL: https://github.com/apache/bookkeeper/pull/4727
--- ### Motivation Currently, the bookie client quarantine mechanism primarily triggers based on `read` and `write` error responses from Bookies. However, in multi-region deployments, a common failure mode is the **Network Partition** or **DNS Resolution Failure** at the Region level. In such scenarios: 1. A Bookie remains registered in ZooKeeper (it can still heartbeat to its local ZK observer). 2. The Client (Broker) cannot resolve the Bookie's IP or establish a TCP connection. 3. The `EnsemblePlacementPolicy` (especially `RegionAwareEnsemblePlacementPolicy`) sees the Bookie as "Available" and repeatedly selects it to satisfy `minRack` or `E/Qw` constraints. 4. The `LedgerHandle` fails to write because it cannot initialize a connection handle, triggering an **Ensemble Change**. 5. Because the connection failure didn't trigger a quarantine, the placement policy picks the **same problematic Bookie** again in the next iteration. This creates an **infinite Ensemble Change loop**, causing the Ledger write to hang indefinitely and bloating the Ledger metadata in ZooKeeper with thousands of segments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
