karanmehta93 opened a new issue #2302: Autorecovery may hang indefinitely when 
zookeeper connection blips
URL: https://github.com/apache/bookkeeper/issues/2302
 
 
   **BUG REPORT**
   
   ***Describe the bug***
   
   In our clusters, @reddycharan and @jvrao caught this issue. In certain 
circumstances, all AR processes were running but not performing any 
replication. Thus ledgers remain under-replicated for periods more than the 
threshold defined by our monitoring system. The reason is that the latch 
[here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L606)
 gets countdown only under certain conditions, listed 
[here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L587-L598).
 
   
   ***To Reproduce***
   
   Imagine that a cluster is in steady state. Since there is no ledger to 
replicate, AR process is waiting around for a zk watcher. Due to a n/w 
partition, the client gets partitioned, AR process gets a `ConnectionLoss` 
event and the latch stops waiting. The retry mechanism kicks in, sets a local 
watch 
[here](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L599).
 When the zk comes back up, the first event we would receive is 
`SyncConnected`. We would process the watcher in that case without ever 
reducing the countdownlatch value. This causes AR process to get stuck forever. 
   
   Note that, this is just a single scenario. This can happen in cases when 
AuthFailures are recovered or other zk events that might get added in future.
   
   Coming up with a test case is hard since we need to get the exact race 
condition in there. However I did verify it with the help of breakpoints in the 
IDE and can confirm.
   
   ***Expected behavior***
   
   AR should start processing on any event it receives from zookeeper
   
   ***Fix***
   
   Decrement the countdown latch irrespective of the event received.
   
   @eolivelli @sijie This can cause serious impact. Workaround is restarting 
all the hung processes, which can be operationally challenging. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to