massakam opened a new pull request #8406:
URL: https://github.com/apache/pulsar/pull/8406


   ### Motivation
   
   The other day, some of our broker servers had deadlocks while splitting 
namespace bundles. As a result of checking the thread dump of the broker, some 
threads were blocked in `NamespaceService#getBundle()`.
   
   ```
   "ForkJoinPool.commonPool-worker-120" #547 daemon prio=5 os_prio=0 
tid=0x00007efab4020800 nid=0x1318b waiting on condition [0x00007efa229e7000]
      java.lang.Thread.State: WAITING (parking)
           at sun.misc.Unsafe.park(Native Method)
           - parking to wait for  <0x00007f385c0dc720> (a 
java.util.concurrent.CompletableFuture$Signaller)
           at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
           at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
           at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
           at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
           at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
           at 
com.github.benmanes.caffeine.cache.LocalAsyncLoadingCache$LoadingCacheView.get(LocalAsyncLoadingCache.java:400)
           at 
org.apache.pulsar.common.naming.NamespaceBundleFactory.getBundles(NamespaceBundleFactory.java:155)
           at 
org.apache.pulsar.broker.namespace.NamespaceService.getBundle(NamespaceService.java:177)
           at 
org.apache.pulsar.broker.namespace.NamespaceService.isTopicOwned(NamespaceService.java:849)
           at 
org.apache.pulsar.broker.namespace.NamespaceService.isServiceUnitOwned(NamespaceService.java:813)
           at 
org.apache.pulsar.broker.service.BrokerService.checkTopicNsOwnership(BrokerService.java:1013)
           at 
org.apache.pulsar.broker.service.BrokerService.loadOrCreatePersistentTopic(BrokerService.java:625)
           at 
org.apache.pulsar.broker.service.BrokerService.lambda$getTopic$6(BrokerService.java:500)
           at 
org.apache.pulsar.broker.service.BrokerService$$Lambda$476/389775283.apply(Unknown
 Source)
           at 
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.put(ConcurrentOpenHashMap.java:274)
           at 
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.computeIfAbsent(ConcurrentOpenHashMap.java:129)
           at 
org.apache.pulsar.broker.service.BrokerService.getTopic(BrokerService.java:499)
           at 
org.apache.pulsar.broker.service.BrokerService.getOrCreateTopic(BrokerService.java:483)
           at 
org.apache.pulsar.broker.service.ServerCnx.lambda$null$13(ServerCnx.java:681)
           at 
org.apache.pulsar.broker.service.ServerCnx$$Lambda$835/1815803313.apply(Unknown 
Source)
           at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
           at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
           at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
           at 
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
           at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943)
           at 
java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
           at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
           at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
           at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
           at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:163)
   ```
   
   I think this is the deadlock that should have been fixed in 
https://github.com/apache/pulsar/pull/4190. It seems that 
https://github.com/apache/pulsar/pull/4190 has been reverted by 
https://github.com/apache/pulsar/pull/5919.
   
   ### Modifications
   
   The blocking method `getBundle()` should not be used in 
`NamespaceService#isTopicOwned()`.  However, reverting 
https://github.com/apache/pulsar/pull/5919 reoccurs the issue that the clients 
cannot reconnect to the topic of the splited bundle.
   
   So, ʻisTopicOwned()` returns false once, but gets the bundle metadata 
asynchronously so that the metadata is cached. The next time the client 
reconnects, the bundle metadata has been cached so it can return the correct 
result.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to