[ 
https://issues.apache.org/jira/browse/IGNITE-14093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Chugunov updated IGNITE-14093:
-------------------------------------
    Reviewer: Sergey Chugunov

> ttl-cleanup-worker falls with AssertionError and leads to 
> CorruptiedTreeException
> ---------------------------------------------------------------------------------
>
>                 Key: IGNITE-14093
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14093
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.9.1
>            Reporter: Mirza Aliev
>            Assignee: Mirza Aliev
>            Priority: Major
>             Fix For: 2.11
>
>         Attachments: IgnitePdsWithTtlDeferredDeleteOnRestartTest (1).java
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue is very rare, it's quite hard to reproduce on mac, some windows 
> users reproduced it a bit often  
> Scenario:
>  # 2 baseline nodes, cache with expiry policy = 60 sec. 
>  # Put some entries in the cache, stop one node immediately.
>  # Remove node from baseline.
>  # Wait until expiration.
>  # Start the stopped node — NPE on node start.
> {code:java}
> [2020-05-08 16:07:17,925][ERROR][ttl-cleanup-worker-#43][root] Critical 
> system error detected. Will be handled accordingly to configured handler 
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, 
> err=java.lang.NullPointerException]]
> java.lang.NullPointerException
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2765)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
>       at 
> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> In some cases, it is possible to get this stacktrace
> {code:java}
> [2020-05-25 
> 10:49:29,677][ERROR][ttl-cleanup-worker-#242%db.IgnitePdsWithTtlDeferredDeleteOnRestartTest2%][IgniteTestResources]
>  Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
> corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, val2=0]], 
> groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow [], 
> upper=PendingRow []]]]]
> class 
> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>  B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple 
> [val1=-1237460590, val2=0]], groupName=group1, msg=Runtime failure on bounds: 
> [lower=PendingRow [], upper=PendingRow []]]
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6110)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1119)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1083)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1078)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2742)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
>       at 
> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.AssertionError: FullPageId [pageId=0001000100000007, 
> effectivePageId=0000000100000007, grpId=-1237460590]
>       at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:822)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:696)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:685)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.acquirePage(DataStructure.java:156)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.acquirePage(BPlusTree.java:6041)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown(BPlusTree.java:1420)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doFind(BPlusTree.java:1397)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$8200(BPlusTree.java:98)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.find(BPlusTree.java:5563)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1103)
>       ... 11 more
> {code}
> To increase chances to reproduce, it might help to add
> {code:java}
> else if (relPtr == OUTDATED_REL_PTR) {
>                 try {
>                     Thread.sleep(1000);
>                 }
>                 catch (InterruptedException e) {
>                     e.printStackTrace();
>                 }
>                 assert PageIdUtils.pageIndex(pageId) == 0 : fullId;
> {code}
> in 
> {{org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl}}
>  
> The root cause of this problem was the fact, that node that was removed from 
> baseline has the gap between restarting and the moment where partition 
> exchange future makes initCachesOnLocalJoin and stops caches for the node, 
> that was removed from baseline. TTL cleanup worker has worked in that gap and 
> continued working even after caches were stopped because TTL manager 
> ({{GridCacheSharedTtlCleanupManager}}) caches a mapping between caches and 
> managers. The solution is to unregister managers for all caches before 
> {{onBaselineChange}} in {{initCachesOnLocalJoin}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to