lhotari commented on PR #17620: URL: https://github.com/apache/pulsar/pull/17620#issuecomment-1246303451
> @lhotari Do you have the stack trace for the deadlock to attach here for context? @merlimat I only had a heap dump of the problem so I'm not sure about the deadlock details. The heap dump includes a thread dump, but without lock info. Here's the full thread dump: https://gist.github.com/lhotari/b9513c1eeb7582b25f60f1c89119789d (line numbers refer to DS [fork of BK 4.14.5](https://github.com/datastax/bookkeeper/tree/ds-4.14.5.1.0.2) and [fork of Pulsar 2.10.x](https://github.com/datastax/pulsar/tree/ls210_1.5)) Here's a subset of threads related to metadata store and ZK. There might be several issues that are causing the issue. I noticed that LeaderElectionImpl.close is also blocked. I guess the ZK session has expired and the recovery for this doesn't work since it's blocked. Here are the metadata & ZK related threads [from the full thread dump](https://gist.github.com/lhotari/b9513c1eeb7582b25f60f1c89119789d): ``` AuditorBookie-bookie9.testcluster.example.com:3181 Waiting Thread ID: 85 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796) java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823) java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998) org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager.getLostBookieRecoveryDelay(PulsarLedgerUnderreplicationManager.java:785) org.apache.bookkeeper.replication.Auditor$9.run(Auditor.java:545) org.apache.bookkeeper.util.SafeRunnable$1.safeRun(SafeRunnable.java:43) org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) java.util.concurrent.FutureTask.run(FutureTask.java:264) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.lang.Thread.run(Thread.java:834) AuditorElector-bookie9.testcluster.example.com:3181 Waiting Thread ID: 83 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796) java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823) java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998) org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager.notifyLostBookieRecoveryDelayChanged(PulsarLedgerUnderreplicationManager.java:804) org.apache.bookkeeper.replication.Auditor.start(Auditor.java:699) org.apache.bookkeeper.replication.AuditorElector$3.run(AuditorElector.java:187) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) java.util.concurrent.FutureTask.run(FutureTask.java:264) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.lang.Thread.run(Thread.java:834) main-EventThread Blocked Thread ID: 59 org.apache.bookkeeper.replication.Auditor.submitAuditTask(Auditor.java:533) org.apache.bookkeeper.replication.Auditor.lambda$watchBookieChanges$0(Auditor.java:1051) org.apache.bookkeeper.replication.Auditor$$Lambda$340.onBookiesChanged() org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$216.accept() java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) org.apache.pulsar.metadata.impl.ZKMetadataStore.handleGetChildrenResult(ZKMetadataStore.java:260) org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$6(ZKMetadataStore.java:191) org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$156.processResult() org.apache.pulsar.metadata.impl.PulsarZooKeeperClient$3$1.processResult(PulsarZooKeeperClient.java:490) org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:712) org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:553) metadata-store-20-1 Waiting Thread ID: 57 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:834) metadata-store-3-1 Waiting Thread ID: 17 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:834) metadata-store-coordination-service-22-1 Waiting Thread ID: 86 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1170) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:834) metadata-store-coordination-service-5-1 Waiting Thread ID: 90 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796) java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823) java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998) org.apache.pulsar.metadata.coordination.impl.LockManagerImpl.lambda$handleSessionEvent$3(LockManagerImpl.java:139) org.apache.pulsar.metadata.coordination.impl.LockManagerImpl$$Lambda$343.run() org.apache.bookkeeper.common.util.SafeRunnable$1.safeRun(SafeRunnable.java:60) org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) java.util.concurrent.FutureTask.run(FutureTask.java:264) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:834) metadata-store-zk-session-watcher-21-1 Waiting Thread ID: 60 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796) java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823) java.util.concurrent.CompletableFuture.join(CompletableFuture.java:2043) org.apache.pulsar.metadata.coordination.impl.LeaderElectionImpl.close(LeaderElectionImpl.java:233) org.apache.pulsar.metadata.bookkeeper.PulsarLedgerAuditorManager.close(PulsarLedgerAuditorManager.java:96) org.apache.bookkeeper.replication.AuditorElector.shutdown(AuditorElector.java:234) org.apache.bookkeeper.replication.AutoRecoveryMain.shutdown(AutoRecoveryMain.java:154) org.apache.bookkeeper.replication.AutoRecoveryMain.lambda$new$0(AutoRecoveryMain.java:99) org.apache.bookkeeper.replication.AutoRecoveryMain$$Lambda$222.onSessionExpired() org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver.lambda$setSessionStateListener$0(PulsarMetadataClientDriver.java:68) org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver$$Lambda$223.accept() org.apache.pulsar.metadata.impl.AbstractMetadataStore.lambda$receivedSessionEvent$11(AbstractMetadataStore.java:297) org.apache.pulsar.metadata.impl.AbstractMetadataStore$$Lambda$342.accept() java.util.concurrent.CopyOnWriteArrayList.forEach(CopyOnWriteArrayList.java:807) org.apache.pulsar.metadata.impl.AbstractMetadataStore.receivedSessionEvent(AbstractMetadataStore.java:295) org.apache.pulsar.metadata.impl.ZKMetadataStore.receivedSessionEvent(ZKMetadataStore.java:152) org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$148.accept() org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkState(ZKSessionWatcher.java:150) org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:110) org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$149.run() org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:54) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:834) metadata-store-zk-session-watcher-4-1 Waiting Thread ID: 20 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234) java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1798) java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1868) java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:104) org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$149.run() org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:54) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:834) ReplicationWorker Waiting Thread ID: 63 java.lang.Object.wait(Object.java) org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager.getLedgerToRereplicate(PulsarLedgerUnderreplicationManager.java:572) org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:264) org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:230) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:834) ZKC-connect-executor-0 Waiting Thread ID: 169 jdk.internal.misc.Unsafe.park(Unsafe.java) java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.lang.Thread.run(Thread.java:834) ZKC-connect-executor-0-EventThread Blocked Thread ID: 823 org.apache.pulsar.metadata.impl.ZKSessionWatcher.process(ZKSessionWatcher.java:120) org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$new$0(ZKMetadataStore.java:101) org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$361.accept() java.util.Optional.ifPresent(Optional.java:183) org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$new$1(ZKMetadataStore.java:101) org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$139.process() org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.notifyEvent(ZooKeeperWatcherBase.java:180) org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.process(ZooKeeperWatcherBase.java:146) org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:578) org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:553) ZKC-connect-executor-0-SendThread(zk01.testcluster.example.com:2181) Runnable Thread ID: 822 sun.nio.ch.EPoll.wait(EPoll.java) sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120) sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124) sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136) org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:332) org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
