@merlimat @jvrao @reddycharan : 

actually I looked at the stack trace again. _so the problem is a race 
condition, the fix is here as okay as well_, *however the race condition is not 
`zookeeper` itself as described in the description, it is a race condition 
between bk and zk, which happens very rarely*. the stack trace is very 
confusing, which will make people think it is a `self-deadlock` situation 
(calling-sync-method-in-async-callback).

Let me summarize my findings and hope these will make things clearer for 
Charan's questions.

## zookeeper deadlock?

Based on matteo's [initial 
investigation](https://github.com/apache/bookkeeper/issues/1578#issuecomment-410174320),
 there is a deadlock at zookeeper thread.

```
"main-EventThread" #11 daemon prio=5 os_prio=0 tid=0x00007f05385d3aa0 
nid=0x5bd2 waiting on condition [0x00007f05207f0000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e1374598> (a 
java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
        at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
        at 
org.apache.bookkeeper.client.SyncCallbackUtils.waitForResult(SyncCallbackUtils.java:45)
        at 
org.apache.bookkeeper.client.BookKeeperAdmin.openLedgerNoRecovery(BookKeeperAdmin.java:327)
        at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:645)
        at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:627)
        at 
org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:510)
        at 
org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:484)
        at 
org.apache.bookkeeper.util.ZkUtils$6$1.processResult(ZkUtils.java:285)
        at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$32$1.processResult(ZooKeeperClient.java:1395)
        at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$31$1.processResult(ZooKeeperClient.java:1356)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:589)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
```

The stack trace here is very confusing. As you can see from the stack trace, we 
are calling a synchronous `openLedgerNoRecovery` from a zookeeper thread 
`org.apache.zookeeper.ClientCnxn$EventThread`. we would think 
`openLedgerNoRecovery` will call zookeeper again, but since 
`openLedgerNoRecovery` is blocking the zookeeper thread and the zookeeper call 
in `openLedgerNoRecovery` needs to be run in the same zookeeper thread. then it 
is a self-deadlock.

*However*, as I said the stack trace here is very very confusing. because the 
self-deadlock would only happen if `openLedgerNoRecovery` is called from same 
zookeeper client's callback. *But*, in Auditor, the zookeeper client used in 
`openLedgerNoRecovery` is different from the zookeeper client used for 
iterating ledgers.

https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L613

for every checkAllLedgers, there is a brand new zookeeper client is created for 
opening ledgers. so the `self-deadlock` situation will not actually happen at 
current code. (hower I would also highlight #1588 would potentially cause 
`self-deadblock` since it attempts to remove the zookeeper client creation. so 
that PR needs to be revisited)

## what is the real deadlock

I looked again at the stack trace here. 
https://gist.github.com/hrsakai/d65e8e2cd511173232b1010a9bbdf126

1. there are a bunch of `BookKeeperClientWorker-OrderedExecutor` threads are 
blocked on waiting `ZkLedgerUnderreplicationManager.markLedgerUnderreplicated` 
to complete. markLedgerUnderreplicated is a zookeeper operation. basically in 
bookkeeper's ordered executor, Auditor is calling zookeeper client to publish 
suspected ledgers (blocking calls). so bookkeeper ordered executor is waiting 
for zookeeper's callback.

```
"BookKeeperClientWorker-OrderedExecutor-5-0" #53 prio=5 os_prio=0 
tid=0x00007f0468002bd0 nid=0x3cb1 waiting on condition [0x00007f04771f0000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e137b988> (a 
java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at 
org.apache.bookkeeper.util.ZkUtils.createFullPathOptimistic(ZkUtils.java:184)
        at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.markLedgerUnderreplicated(ZkLedgerUnderreplicationManager.java:269)
        at 
org.apache.bookkeeper.replication.Auditor.publishSuspectedLedgers(Auditor.java:550)
        at 
org.apache.bookkeeper.replication.Auditor.access$1400(Auditor.java:79)
        at 
org.apache.bookkeeper.replication.Auditor$ProcessLostFragmentsCb.operationComplete(Auditor.java:580)
        at 
org.apache.bookkeeper.replication.Auditor$ProcessLostFragmentsCb.operationComplete(Auditor.java:563)
        at 
org.apache.bookkeeper.client.LedgerChecker$FullLedgerCallback.operationComplete(LedgerChecker.java:303)
        at 
org.apache.bookkeeper.client.LedgerChecker$FullLedgerCallback.operationComplete(LedgerChecker.java:282)
        at 
org.apache.bookkeeper.client.LedgerChecker$LedgerFragmentCallback.operationComplete(LedgerChecker.java:130)
        at 
org.apache.bookkeeper.client.LedgerChecker$LedgerFragmentCallback.operationComplete(LedgerChecker.java:91)
        at 
org.apache.bookkeeper.client.LedgerChecker$ReadManyEntriesCallback.readEntryComplete(LedgerChecker.java:83)
        at 
org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion$1.readEntryComplete(PerChannelBookieClient.java:1559)
        at 
org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion.lambda$errorOut$0(PerChannelBookieClient.java:1575)
        at 
org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion$$Lambda$28/167038546.run(Unknown
 Source)
        at 
org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue$1.safeRun(PerChannelBookieClient.java:1417)
        at 
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
```

2. Go back to the stack trace of `openLedgerNoRecovery`. openLedgerRecovery is 
blocking the zookeeper thread and waiting bookkeeper's callback to complete. 
However bookkeeper's callback has to be run in ordered scheduler, but those 
threads are waiting zookeeper callback to complete. **hence the deadlock 
happens**.

```
"main-EventThread" #11 daemon prio=5 os_prio=0 tid=0x00007f05385d3aa0 
nid=0x5bd2 waiting on condition [0x00007f05207f0000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e1374598> (a 
java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
        at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
        at 
org.apache.bookkeeper.client.SyncCallbackUtils.waitForResult(SyncCallbackUtils.java:45)
        at 
org.apache.bookkeeper.client.BookKeeperAdmin.openLedgerNoRecovery(BookKeeperAdmin.java:327)
        at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:645)
        at org.apache.bookkeeper.replication.Auditor$6.process(Auditor.java:627)
        at 
org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:510)
        at 
org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.operationComplete(AbstractZkLedgerManager.java:484)
        at 
org.apache.bookkeeper.util.ZkUtils$6$1.processResult(ZkUtils.java:285)
        at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$32$1.processResult(ZooKeeperClient.java:1395)
        at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$31$1.processResult(ZooKeeperClient.java:1356)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:589)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)

```

**in summary, the real deadlock happens due to:

- we are calling sync zookeeper methods in bk's callback thread
- we are calling sync bookkeeper methods in zk's callback thread

both are waiting for each other.**

This happens rarely because bookkeeper has multiple callback threads. most of 
the time, the bookie call thread which publishes suspected ledger might be 
different from the thread that completes `openNoLedgerRecovery`, since ledger 
id would be different most of the time. It will only happen if two ledger ids 
are assigned to same bookkeeper callback thread and publishSuspectLedger and 
openLedgerNoRecovery happen at the same time.

## What is the right fix?

The fix here is okay but not correct. The right fix is to move all synchronous 
calls out of bk callback threads and zk callback thread. That's the first 
immediate action we can take. Then consider rewrite those sync methods to use 
async methods to completely get rid of sync logic.

----- 

Above is all my findings regarding this issue. Below I will leave my thoughts 
to some of Charan and JV's questions.

> Where do we define the number of ZK event threads?

there is only one send thread and one receive thread per zookeeper client. it 
is non-configurable.

> If we have only one event thread, then it is a deadlock everytime right?

yes if we call same zk client's sync method in its callback.

> So I believe the important fix needed here is to not wait forever on this 
> latch 

I don't think "not wait forever" on this latch is the right fix. for sure you 
can timeout and let next check kick in. However the problem here is deadlock 
makes both bookkeeper and zookeeper client are not functionable. so introducing 
timeout just delay and hide the problem. In a production system, I would rather 
than let it wait forever and use metric (e.g. the number of auditor runs) for 
alerting. for example, if the number of auditor runs doesn't increment for more 
than checkIntervals, that means auditor is stuck. then an alert should be 
triggered. 

>  Ideally I would move the checkers functionality to some other 
> threadpool/executor, so that it wont impact the core functionality of 
> Auditor, 

yeah dedicated threadpools for different checkers is a good idea. however I 
think the most critical fix is not calling sync methods in async callbacks.



[ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to