[
https://issues.apache.org/jira/browse/IGNITE-14093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Chugunov updated IGNITE-14093:
-------------------------------------
Reviewer: Sergey Chugunov
> ttl-cleanup-worker falls with AssertionError and leads to
> CorruptiedTreeException
> ---------------------------------------------------------------------------------
>
> Key: IGNITE-14093
> URL: https://issues.apache.org/jira/browse/IGNITE-14093
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.9.1
> Reporter: Mirza Aliev
> Assignee: Mirza Aliev
> Priority: Major
> Fix For: 2.11
>
> Attachments: IgnitePdsWithTtlDeferredDeleteOnRestartTest (1).java
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This issue is very rare, it's quite hard to reproduce on mac, some windows
> users reproduced it a bit often
> Scenario:
> # 2 baseline nodes, cache with expiry policy = 60 sec.
> # Put some entries in the cache, stop one node immediately.
> # Remove node from baseline.
> # Wait until expiration.
> # Start the stopped node — NPE on node start.
> {code:java}
> [2020-05-08 16:07:17,925][ERROR][ttl-cleanup-worker-#43][root] Critical
> system error detected. Will be handled accordingly to configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION,
> err=java.lang.NullPointerException]]
> java.lang.NullPointerException
> at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2765)
> at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
> at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
> at
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
> at
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
> at
> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
> at
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> In some cases, it is possible to get this stacktrace
> {code:java}
> [2020-05-25
> 10:49:29,677][ERROR][ttl-cleanup-worker-#242%db.IgnitePdsWithTtlDeferredDeleteOnRestartTest2%][IgniteTestResources]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is
> corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, val2=0]],
> groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow [],
> upper=PendingRow []]]]]
> class
> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
> B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple
> [val1=-1237460590, val2=0]], groupName=group1, msg=Runtime failure on bounds:
> [lower=PendingRow [], upper=PendingRow []]]
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6110)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1119)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1083)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1078)
> at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2742)
> at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
> at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
> at
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
> at
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
> at
> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
> at
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.AssertionError: FullPageId [pageId=0001000100000007,
> effectivePageId=0000000100000007, grpId=-1237460590]
> at
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:822)
> at
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:696)
> at
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:685)
> at
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.acquirePage(DataStructure.java:156)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.acquirePage(BPlusTree.java:6041)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown(BPlusTree.java:1420)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doFind(BPlusTree.java:1397)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$8200(BPlusTree.java:98)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.find(BPlusTree.java:5563)
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1103)
> ... 11 more
> {code}
> To increase chances to reproduce, it might help to add
> {code:java}
> else if (relPtr == OUTDATED_REL_PTR) {
> try {
> Thread.sleep(1000);
> }
> catch (InterruptedException e) {
> e.printStackTrace();
> }
> assert PageIdUtils.pageIndex(pageId) == 0 : fullId;
> {code}
> in
> {{org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl}}
>
> The root cause of this problem was the fact, that node that was removed from
> baseline has the gap between restarting and the moment where partition
> exchange future makes initCachesOnLocalJoin and stops caches for the node,
> that was removed from baseline. TTL cleanup worker has worked in that gap and
> continued working even after caches were stopped because TTL manager
> ({{GridCacheSharedTtlCleanupManager}}) caches a mapping between caches and
> managers. The solution is to unregister managers for all caches before
> {{onBaselineChange}} in {{initCachesOnLocalJoin}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)