lhotari edited a comment on pull request #9764:
URL: https://github.com/apache/pulsar/pull/9764#issuecomment-787887121


   Here's an example of a suspicious stacktrace from one of the ReplicatorTest 
timeouts ([full 
threaddump](https://jstack.review/?https://gist.github.com/lhotari/3d2aa04041f86645622e044fbca18003#tda_1_dump)):
   ```
   "ForkJoinPool.commonPool-worker-2" daemon prio=5 tid=503 in Object.wait()
   java.lang.Thread.State: WAITING (on object monitor)
           at sun.misc.Unsafe.park(Native Method)
           at 
java.util.concurrent.locks.StampedLock.acquireRead(StampedLock.java:1286)
           at 
java.util.concurrent.locks.StampedLock.readLock(StampedLock.java:428)
           at 
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:367)
           at 
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:159)
           at 
org.apache.pulsar.broker.service.persistent.PersistentTopic.onPoliciesUpdate(PersistentTopic.java:2170)
           at 
org.apache.pulsar.broker.service.BrokerService.lambda$null$62(BrokerService.java:1674)
           at 
org.apache.pulsar.broker.service.BrokerService$$Lambda$922/1818690322.accept(Unknown
 Source)
           at java.util.Optional.ifPresent(Optional.java:159)
           at 
org.apache.pulsar.broker.service.BrokerService.lambda$null$63(BrokerService.java:1674)
           at 
org.apache.pulsar.broker.service.BrokerService$$Lambda$921/148576384.accept(Unknown
 Source)
           at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
           at 
java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683)
           at 
java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010)
           at 
org.apache.pulsar.broker.service.BrokerService.lambda$onUpdate$64(BrokerService.java:1669)
           at 
org.apache.pulsar.broker.service.BrokerService$$Lambda$920/1723403002.accept(Unknown
 Source)
           at 
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:387)
           at 
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:159)
           at 
org.apache.pulsar.broker.service.BrokerService.onUpdate(BrokerService.java:1665)
           at 
org.apache.pulsar.broker.service.BrokerService.onUpdate(BrokerService.java:175)
           at 
org.apache.pulsar.zookeeper.ZooKeeperDataCache.lambda$4(ZooKeeperDataCache.java:138)
           at 
org.apache.pulsar.zookeeper.ZooKeeperDataCache$$Lambda$538/1964134182.accept(Unknown
 Source)
           at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
           at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
           at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
           at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
           at 
org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$18(ZooKeeperCache.java:386)
           at 
org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$188/1103281887.accept(Unknown
 Source)
           at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
           at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
           at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
           at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
           at 
org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$14(ZooKeeperCache.java:366)
           at 
org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$198/880015051.run(Unknown 
Source)
           at 
java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
           at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
           at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
           at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
           at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
   ```
   
   whenever Policies get updated, this code will get executed:
   
https://github.com/apache/pulsar/blob/dbf29e95f7a4003bd905a7bd2ad13a5c46ae1a5e/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1659-L1682
   
   One can find the other possible partiticipants of the dead lock by searching 
for "ConcurrentOpenHashMap" in the [thread dump 
file](https://gist.github.com/lhotari/3d2aa04041f86645622e044fbca18003).
   
   This type of issues stopped occuring in ReplicatorTest after applying the 
changes in this PR.
   
   When looking at the existing code in 
[`ConcurrentOpenHashMap$Section.forEach`](
   
https://github.com/apache/pulsar/blob/34ca8938ca13ea60b05d164c27d9755855caf87c/pulsar-common/src/main/java/org/apache/pulsar/common/util/collections/ConcurrentOpenHashMap.java#L356-L395),
 it seems that there wouldn't be a major difference in data consistency when 
using the snapshot approach. The forEach uses optimistic reads. If a rehashing 
happens, it will continue to iterate the table which was in place before the 
rehashing. The read lock gets acquired only in the case where some write 
happens during the read. Examples can be seen from ReplicatorTest failures 
where there are actual failures so there seems to be a clear need to reduce 
duration of locks when processing items held in ConcurrentOpenHashMaps. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to