[GitHub] [bookkeeper] karanmehta93 opened a new issue #2302: Autorecovery may hang indefinitely when zookeeper connection blips

GitBox Fri, 03 Apr 2020 18:40:24 -0700

karanmehta93 opened a new issue #2302: Autorecovery may hang indefinitely when
zookeeper connection blips
URL: https://github.com/apache/bookkeeper/issues/2302

**BUG REPORT**

***Describe the bug***

In our clusters, @reddycharan and @jvrao caught this issue. In certain
circumstances, all AR processes were running but not performing any
replication. Thus ledgers remain under-replicated for periods more than the
threshold defined by our monitoring system. The reason is that the latch
[here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L606)
gets countdown only under certain conditions, listed
[here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L587-L598).

***To Reproduce***

Imagine that a cluster is in steady state. Since there is no ledger to
replicate, AR process is waiting around for a zk watcher. Due to a n/w
partition, the client gets partitioned, AR process gets a `ConnectionLoss`
event and the latch stops waiting. The retry mechanism kicks in, sets a local
watch
[here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L599).
When the zk comes back up, the first event we would receive is
`SyncConnected`. We would process the watcher in that case without ever
reducing the countdownlatch value. This causes AR process to get stuck forever.

Note that, this is just a single scenario. This can happen in cases when
AuthFailures are recovered or other zk events that might get added in future.

Coming up with a test case is hard since we need to get the exact race
condition in there. However I did verify it with the help of breakpoints in the
IDE and can confirm.

***Expected behavior***

AR should start processing on any event it receives from zookeeper

***Fix***

Decrement the countdown latch irrespective of the event received.

@eolivelli @sijie This can cause serious impact. Workaround is restarting
all the hung processes, which can be operationally challenging.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [bookkeeper] karanmehta93 opened a new issue #2302: Autorecovery may hang indefinitely when zookeeper connection blips

Reply via email to