[ 
https://issues.apache.org/jira/browse/IGNITE-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432094#comment-16432094
 ] 

Alexey Goncharuk commented on IGNITE-7871:
------------------------------------------

I also found this deadlock in TC tests:
{code}

Found one Java-level deadlock:
=============================
"sys-#55123%dht.GridCacheAtomicNearCacheSelfTest2%":
  waiting to lock monitor 0x00007f58a019a7c8 (object 0x00000000e3e33370, a 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch),
  which is held by 
"exchange-worker-#55118%dht.GridCacheAtomicNearCacheSelfTest2%"
"exchange-worker-#55118%dht.GridCacheAtomicNearCacheSelfTest2%":
  waiting for ownable synchronizer 0x00000000de084358, (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "sys-#55123%dht.GridCacheAtomicNearCacheSelfTest2%"

Java stack information for the threads listed above:
===================================================
"sys-#55123%dht.GridCacheAtomicNearCacheSelfTest2%":
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.newCoordinator(ExchangeLatchManager.java:565)
        - waiting to lock <0x00000000e3e33370> (a 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.access$300(ExchangeLatchManager.java:521)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager.processNodeLeft(ExchangeLatchManager.java:373)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager.lambda$null$1(ExchangeLatchManager.java:115)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$$Lambda$36/1235895228.run(Unknown
 Source)
        at 
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6746)
        at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor$1.body(GridClosureProcessor.java:827)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
"exchange-worker-#55118%dht.GridCacheAtomicNearCacheSelfTest2%":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000de084358> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager.processAck(ExchangeLatchManager.java:268)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager.lambda$new$0(ExchangeLatchManager.java:101)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$$Lambda$1/832828638.onMessage(Unknown
 Source)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1632)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1715)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.sendAck(ExchangeLatchManager.java:578)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.countDown(ExchangeLatchManager.java:596)
        - locked <0x00000000e3e33370> (a 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.waitPartitionRelease(GridDhtPartitionsExchangeFuture.java:1322)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1111)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:712)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2401)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2290)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at java.lang.Thread.run(Thread.java:745)
{code}

> Implement 2-phase waiting for partition release
> -----------------------------------------------
>
>                 Key: IGNITE-7871
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7871
>             Project: Ignite
>          Issue Type: Improvement
>          Components: cache
>    Affects Versions: 2.4
>            Reporter: Pavel Kovalenko
>            Assignee: Alexey Goncharuk
>            Priority: Major
>             Fix For: 2.5
>
>
> Using validation implemented in IGNITE-7467 we can observe the following 
> situation:
> Let's we have some partition and nodes which owning it N1 (primary) and N2 
> (backup)
> 1) Exchange is started
> 2) N2 finished waiting for partitions release and started to create Single 
> message (with update counters).
> 3) N1 waits for partitions release.
> 4) We have pending cache update N1 -> N2. This update is done after step 2.
> 5) This update increments update counters both on N1 and N2.
> 6) N1 finished waiting for partitions release, while N2 already sent Single 
> message to coordinator with outdated update counter.
> 7) Coordinator sees different partition update counters for N1 and N2. 
> Validation is failed, while data is equal.  
> Solution:
> Every server node participating in PME should wait while all other server 
> nodes will finish their ongoing updates (finish wait for partition release 
> method)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to