karanmehta93 opened a new issue #2302: Autorecovery may hang indefinitely when zookeeper connection blips URL: https://github.com/apache/bookkeeper/issues/2302 **BUG REPORT** ***Describe the bug*** In our clusters, @reddycharan and @jvrao caught this issue. In certain circumstances, all AR processes were running but not performing any replication. Thus ledgers remain under-replicated for periods more than the threshold defined by our monitoring system. The reason is that the latch [here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L606) gets countdown only under certain conditions, listed [here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L587-L598). ***To Reproduce*** Imagine that a cluster is in steady state. Since there is no ledger to replicate, AR process is waiting around for a zk watcher. Due to a n/w partition, the client gets partitioned, AR process gets a `ConnectionLoss` event and the latch stops waiting. The retry mechanism kicks in, sets a local watch [here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L599). When the zk comes back up, the first event we would receive is `SyncConnected`. We would process the watcher in that case without ever reducing the countdownlatch value. This causes AR process to get stuck forever. Note that, this is just a single scenario. This can happen in cases when AuthFailures are recovered or other zk events that might get added in future. Coming up with a test case is hard since we need to get the exact race condition in there. However I did verify it with the help of breakpoints in the IDE and can confirm. ***Expected behavior*** AR should start processing on any event it receives from zookeeper ***Fix*** Decrement the countdown latch irrespective of the event received. @eolivelli @sijie This can cause serious impact. Workaround is restarting all the hung processes, which can be operationally challenging.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
