[ https://issues.apache.org/jira/browse/IGNITE-12780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111165#comment-17111165 ]
Anton Kalashnikov commented on IGNITE-12780: -------------------------------------------- As I can see deadlock is not a reason, it is a consequence. If the execution path is correct, ignite always initialize all cacheDataStore after logical recovery(right here org.apache.ignite.internal.processors.cache.GridCacheProcessor.CacheRecoveryLifecycle#restorePartitionStates). But in this case, unhandled exception happened and ignite just skip some recovery logic(which is totally wrong) and continues to start node. So main problem is here: org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#finalizeCheckpointOnRecovery {noformat} catch (IgniteCheckedException e) { U.error(log, "Failed to write page to pageStore: " + res); writePagesError.compareAndSet(null, e); } {noformat} We should catch all exceptions not only IgniteCheckedException. The second problem is IgniteSequentialNodeCrashRecoveryTest#testCrashOnCheckpointAfterLogicalRecovery which count dirty pages incorrectly(it forget to move free list to offheap before calculation). So it also should be fixed. [~v.pyatkov] if you have something to add to my assumption, feel free to share. > Deadlock between db-checkpoint-thread and checkpoint-runner > ----------------------------------------------------------- > > Key: IGNITE-12780 > URL: https://issues.apache.org/jira/browse/IGNITE-12780 > Project: Ignite > Issue Type: Bug > Reporter: Vladislav Pyatkov > Assignee: Anton Kalashnikov > Priority: Critical > Labels: MakeTeamcityGreenAgain > Fix For: 2.9 > > > Look at this run: > https://ci.ignite.apache.org/buildConfiguration/IgniteTests24Java8_PdsIndexing/5121878?buildTab=log&focusLine=3 > {noformat} > "db-checkpoint-thread-#46926%db.IgniteSequentialNodeCrashRecoveryTest0%" > #55580 prio=5 os_prio=0 tid=0x00007efb2000c800 nid=0x77e waiting on condition > [0x00007eff31add000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178) > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.fillCacheGroupState(GridCacheDatabaseSharedManager.java:4367) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:4147) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:3728) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:3617) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > > > "checkpoint-runner-#46927%db.IgniteSequentialNodeCrashRecoveryTest0%" #55581 > prio=5 os_prio=0 tid=0x00007efbd4009000 nid=0x77f waiting on condition > [0x00007eff317da000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000e5c23ed8> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1645) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1688) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.fullSize(GridCacheOffheapManager.java:2061) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.lambda$fillCacheGroupState$1(GridCacheDatabaseSharedManager.java:4336) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer$$Lambda$565/253081186.run(Unknown > Source) > at > org.apache.ignite.internal.util.IgniteUtils.lambda$wrapIgniteFuture$3(IgniteUtils.java:11392) > at > org.apache.ignite.internal.util.IgniteUtils$$Lambda$561/471384364.run(Unknown > Source) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > I suspect this issue happening due to IgniteSequentialNodeCrashRecoveryTest -- This message was sent by Atlassian Jira (v8.3.4#803005)