[ 
https://issues.apache.org/jira/browse/IGNITE-12780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111165#comment-17111165
 ] 

Anton Kalashnikov commented on IGNITE-12780:
--------------------------------------------

As I can see deadlock is not a reason, it is a consequence. If the execution 
path is correct, ignite always initialize all cacheDataStore after logical 
recovery(right here 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.CacheRecoveryLifecycle#restorePartitionStates).
 But in this case, unhandled exception happened and ignite just skip some 
recovery logic(which is totally wrong) and continues to start node.

So main problem is here:
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#finalizeCheckpointOnRecovery
{noformat}
catch (IgniteCheckedException e) {
           U.error(log, "Failed to write page to pageStore: " + res);

           writePagesError.compareAndSet(null, e);
}
{noformat}
We should catch all exceptions not only IgniteCheckedException.

The second problem is 
IgniteSequentialNodeCrashRecoveryTest#testCrashOnCheckpointAfterLogicalRecovery 
which count dirty pages incorrectly(it forget to move free list to offheap 
before calculation). So it also should be fixed.

[~v.pyatkov] if you have something to add to my assumption, feel free to share.

> Deadlock between db-checkpoint-thread and checkpoint-runner
> -----------------------------------------------------------
>
>                 Key: IGNITE-12780
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12780
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladislav Pyatkov
>            Assignee: Anton Kalashnikov
>            Priority: Critical
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.9
>
>
> Look at this run:
> https://ci.ignite.apache.org/buildConfiguration/IgniteTests24Java8_PdsIndexing/5121878?buildTab=log&focusLine=3
> {noformat}
> "db-checkpoint-thread-#46926%db.IgniteSequentialNodeCrashRecoveryTest0%" 
> #55580 prio=5 os_prio=0 tid=0x00007efb2000c800 nid=0x77e waiting on condition 
> [0x00007eff31add000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
>         at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>         at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.fillCacheGroupState(GridCacheDatabaseSharedManager.java:4367)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:4147)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:3728)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:3617)
>         at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>         at java.lang.Thread.run(Thread.java:748)
>               
>               
> "checkpoint-runner-#46927%db.IgniteSequentialNodeCrashRecoveryTest0%" #55581 
> prio=5 os_prio=0 tid=0x00007efbd4009000 nid=0x77f waiting on condition 
> [0x00007eff317da000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000000e5c23ed8> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>         at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1645)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1688)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.fullSize(GridCacheOffheapManager.java:2061)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.lambda$fillCacheGroupState$1(GridCacheDatabaseSharedManager.java:4336)
>         at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer$$Lambda$565/253081186.run(Unknown
>  Source)
>         at 
> org.apache.ignite.internal.util.IgniteUtils.lambda$wrapIgniteFuture$3(IgniteUtils.java:11392)
>         at 
> org.apache.ignite.internal.util.IgniteUtils$$Lambda$561/471384364.run(Unknown 
> Source)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I suspect this issue happening due to IgniteSequentialNodeCrashRecoveryTest



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to