lhotari edited a comment on pull request #9764: URL: https://github.com/apache/pulsar/pull/9764#issuecomment-787887121
Here's an example of a suspicious stacktrace from one of the ReplicatorTest timeouts ([full threaddump](https://jstack.review/?https://gist.github.com/lhotari/3d2aa04041f86645622e044fbca18003#tda_1_dump)): ``` "ForkJoinPool.commonPool-worker-2" daemon prio=5 tid=503 in Object.wait() java.lang.Thread.State: WAITING (on object monitor) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.StampedLock.acquireRead(StampedLock.java:1286) at java.util.concurrent.locks.StampedLock.readLock(StampedLock.java:428) at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:367) at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:159) at org.apache.pulsar.broker.service.persistent.PersistentTopic.onPoliciesUpdate(PersistentTopic.java:2170) at org.apache.pulsar.broker.service.BrokerService.lambda$null$62(BrokerService.java:1674) at org.apache.pulsar.broker.service.BrokerService$$Lambda$922/1818690322.accept(Unknown Source) at java.util.Optional.ifPresent(Optional.java:159) at org.apache.pulsar.broker.service.BrokerService.lambda$null$63(BrokerService.java:1674) at org.apache.pulsar.broker.service.BrokerService$$Lambda$921/148576384.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683) at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010) at org.apache.pulsar.broker.service.BrokerService.lambda$onUpdate$64(BrokerService.java:1669) at org.apache.pulsar.broker.service.BrokerService$$Lambda$920/1723403002.accept(Unknown Source) at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:387) at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:159) at org.apache.pulsar.broker.service.BrokerService.onUpdate(BrokerService.java:1665) at org.apache.pulsar.broker.service.BrokerService.onUpdate(BrokerService.java:175) at org.apache.pulsar.zookeeper.ZooKeeperDataCache.lambda$4(ZooKeeperDataCache.java:138) at org.apache.pulsar.zookeeper.ZooKeeperDataCache$$Lambda$538/1964134182.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$18(ZooKeeperCache.java:386) at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$188/1103281887.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$14(ZooKeeperCache.java:366) at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$198/880015051.run(Unknown Source) at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) ``` whenever Policies get updated, this code will get executed: https://github.com/apache/pulsar/blob/dbf29e95f7a4003bd905a7bd2ad13a5c46ae1a5e/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1659-L1682 One can find the other possible partiticipants of the dead lock by searching for "ConcurrentOpenHashMap" in the [thread dump file](https://gist.github.com/lhotari/3d2aa04041f86645622e044fbca18003). This type of issues stopped occuring in ReplicatorTest after applying the changes in this PR. When looking at the existing code in [`ConcurrentOpenHashMap$Section.forEach`]( https://github.com/apache/pulsar/blob/34ca8938ca13ea60b05d164c27d9755855caf87c/pulsar-common/src/main/java/org/apache/pulsar/common/util/collections/ConcurrentOpenHashMap.java#L356-L395), it seems that there wouldn't be a major difference in data consistency when using the snapshot approach. The forEach uses optimistic reads. If a rehashing happens, it will continue to iterate the table which was in place before the rehashing. The read lock gets acquired only in the case where some write happens during the read. Examples can be seen from ReplicatorTest failures where there are actual failures so there seems to be a clear need to reduce duration of locks when processing items held in ConcurrentOpenHashMaps. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
