@merlimat @jvrao @reddycharan : actually I looked at the stack trace again. _so the problem is a race condition, the fix is here as okay as well_, *however the race condition is not `zookeeper` itself as described in the description, it is a race condition between bk and zk, which happens very rarely*. the stack trace is very confusing, which will make people think it is a `self-deadlock` situation (calling-sync-method-in-async-callback).
Let me summarize my findings and hope these will make things clearer for Charan's questions. ## zookeeper deadlock? Based on matteo's [initial investigation](https://github.com/apache/bookkeeper/issues/1578#issuecomment-410174320), there is a deadlock at zookeeper thread. ``` "main-EventThread" #11 daemon prio=5 os_prio=0 tid=0x00007f05385d3aa0 nid=0x5bd2 waiting on condition [0x00007f05207f0000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000e1374598> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.bookkeeper.client.SyncCallbackUtils.waitForResult(SyncCallbackUtils.java:45) at org.apache.bookkeeper.client.BookKeeperAdmin.openLedgerNoRecovery(BookKeeperAdmin.java:327) at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:645) at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:627) at org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:510) at org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:484) at org.apache.bookkeeper.util.ZkUtils$6$1.processResult(ZkUtils.java:285) at org.apache.bookkeeper.zookeeper.ZooKeeperClient$32$1.processResult(ZooKeeperClient.java:1395) at org.apache.bookkeeper.zookeeper.ZooKeeperClient$31$1.processResult(ZooKeeperClient.java:1356) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505) ``` The stack trace here is very confusing. As you can see from the stack trace, we are calling a synchronous `openLedgerNoRecovery` from a zookeeper thread `org.apache.zookeeper.ClientCnxn$EventThread`. we would think `openLedgerNoRecovery` will call zookeeper again, but since `openLedgerNoRecovery` is blocking the zookeeper thread and the zookeeper call in `openLedgerNoRecovery` needs to be run in the same zookeeper thread. then it is a self-deadlock. *However*, as I said the stack trace here is very very confusing. because the self-deadlock would only happen if `openLedgerNoRecovery` is called from same zookeeper client's callback. *But*, in Auditor, the zookeeper client used in `openLedgerNoRecovery` is different from the zookeeper client used for iterating ledgers. https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L613 for every checkAllLedgers, there is a brand new zookeeper client is created for opening ledgers. so the `self-deadlock` situation will not actually happen at current code. (hower I would also highlight #1588 would potentially cause `self-deadblock` since it attempts to remove the zookeeper client creation. so that PR needs to be revisited) ## what is the real deadlock I looked again at the stack trace here. https://gist.github.com/hrsakai/d65e8e2cd511173232b1010a9bbdf126 1. there are a bunch of `BookKeeperClientWorker-OrderedExecutor` threads are blocked on waiting `ZkLedgerUnderreplicationManager.markLedgerUnderreplicated` to complete. markLedgerUnderreplicated is a zookeeper operation. basically in bookkeeper's ordered executor, Auditor is calling zookeeper client to publish suspected ledgers (blocking calls). so bookkeeper ordered executor is waiting for zookeeper's callback. ``` "BookKeeperClientWorker-OrderedExecutor-5-0" #53 prio=5 os_prio=0 tid=0x00007f0468002bd0 nid=0x3cb1 waiting on condition [0x00007f04771f0000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000e137b988> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.apache.bookkeeper.util.ZkUtils.createFullPathOptimistic(ZkUtils.java:184) at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.markLedgerUnderreplicated(ZkLedgerUnderreplicationManager.java:269) at org.apache.bookkeeper.replication.Auditor.publishSuspectedLedgers(Auditor.java:550) at org.apache.bookkeeper.replication.Auditor.access$1400(Auditor.java:79) at org.apache.bookkeeper.replication.Auditor$ProcessLostFragmentsCb.operationComplete(Auditor.java:580) at org.apache.bookkeeper.replication.Auditor$ProcessLostFragmentsCb.operationComplete(Auditor.java:563) at org.apache.bookkeeper.client.LedgerChecker$FullLedgerCallback.operationComplete(LedgerChecker.java:303) at org.apache.bookkeeper.client.LedgerChecker$FullLedgerCallback.operationComplete(LedgerChecker.java:282) at org.apache.bookkeeper.client.LedgerChecker$LedgerFragmentCallback.operationComplete(LedgerChecker.java:130) at org.apache.bookkeeper.client.LedgerChecker$LedgerFragmentCallback.operationComplete(LedgerChecker.java:91) at org.apache.bookkeeper.client.LedgerChecker$ReadManyEntriesCallback.readEntryComplete(LedgerChecker.java:83) at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion$1.readEntryComplete(PerChannelBookieClient.java:1559) at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion.lambda$errorOut$0(PerChannelBookieClient.java:1575) at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion$$Lambda$28/167038546.run(Unknown Source) at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue$1.safeRun(PerChannelBookieClient.java:1417) at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` 2. Go back to the stack trace of `openLedgerNoRecovery`. openLedgerRecovery is blocking the zookeeper thread and waiting bookkeeper's callback to complete. However bookkeeper's callback has to be run in ordered scheduler, but those threads are waiting zookeeper callback to complete. **hence the deadlock happens**. ``` "main-EventThread" #11 daemon prio=5 os_prio=0 tid=0x00007f05385d3aa0 nid=0x5bd2 waiting on condition [0x00007f05207f0000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000e1374598> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.bookkeeper.client.SyncCallbackUtils.waitForResult(SyncCallbackUtils.java:45) at org.apache.bookkeeper.client.BookKeeperAdmin.openLedgerNoRecovery(BookKeeperAdmin.java:327) at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:645) at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:627) at org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:510) at org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:484) at org.apache.bookkeeper.util.ZkUtils$6$1.processResult(ZkUtils.java:285) at org.apache.bookkeeper.zookeeper.ZooKeeperClient$32$1.processResult(ZooKeeperClient.java:1395) at org.apache.bookkeeper.zookeeper.ZooKeeperClient$31$1.processResult(ZooKeeperClient.java:1356) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505) ``` **in summary, the real deadlock happens due to: - we are calling sync zookeeper methods in bk's callback thread - we are calling sync bookkeeper methods in zk's callback thread both are waiting for each other.** This happens rarely because bookkeeper has multiple callback threads. most of the time, the bookie call thread which publishes suspected ledger might be different from the thread that completes `openNoLedgerRecovery`, since ledger id would be different most of the time. It will only happen if two ledger ids are assigned to same bookkeeper callback thread and publishSuspectLedger and openLedgerNoRecovery happen at the same time. ## What is the right fix? The fix here is okay but not correct. The right fix is to move all synchronous calls out of bk callback threads and zk callback thread. That's the first immediate action we can take. Then consider rewrite those sync methods to use async methods to completely get rid of sync logic. ----- Above is all my findings regarding this issue. Below I will leave my thoughts to some of Charan and JV's questions. > Where do we define the number of ZK event threads? there is only one send thread and one receive thread per zookeeper client. it is non-configurable. > If we have only one event thread, then it is a deadlock everytime right? yes if we call same zk client's sync method in its callback. > So I believe the important fix needed here is to not wait forever on this > latch I don't think "not wait forever" on this latch is the right fix. for sure you can timeout and let next check kick in. However the problem here is deadlock makes both bookkeeper and zookeeper client are not functionable. so introducing timeout just delay and hide the problem. In a production system, I would rather than let it wait forever and use metric (e.g. the number of auditor runs) for alerting. for example, if the number of auditor runs doesn't increment for more than checkIntervals, that means auditor is stuck. then an alert should be triggered. > Ideally I would move the checkers functionality to some other > threadpool/executor, so that it wont impact the core functionality of > Auditor, yeah dedicated threadpools for different checkers is a good idea. however I think the most critical fix is not calling sync methods in async callbacks. [ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ] This message was relayed via gitbox.apache.org for [email protected]
