[ 
https://issues.apache.org/jira/browse/IGNITE-14093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mirza Aliev updated IGNITE-14093:
---------------------------------
    Description: 
This issue is very rare, it's quite hard to reproduce on mac, some windows 
users reproduced it a bit often  

Scenario:
 # 2 baseline nodes, cache with expiry policy = 60 sec. 
 # Put some entries in the cache, stop one node immediately.
 # Remove node from baseline.
 # Wait until expiration.
 # Start the stopped node — NPE on node start.

{code:java}
[2020-05-08 16:07:17,925][ERROR][ttl-cleanup-worker-#43][root] Critical system 
error detected. Will be handled accordingly to configured handler 
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, 
err=java.lang.NullPointerException]]
java.lang.NullPointerException
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2765)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
        at 
org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
        at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
        at 
java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
        at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
        at java.lang.Thread.run(Thread.java:748)
{code}
In some cases, it is possible to get this stacktrace
{code:java}
[2020-05-25 
10:49:29,677][ERROR][ttl-cleanup-worker-#242%db.IgnitePdsWithTtlDeferredDeleteOnRestartTest2%][IgniteTestResources]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=class 
o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, val2=0]], 
groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow [], 
upper=PendingRow []]]]]
class 
org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
 B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, 
val2=0]], groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow 
[], upper=PendingRow []]]
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6110)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1119)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1083)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1078)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2742)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
        at 
org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
        at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
        at 
java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
        at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: FullPageId [pageId=0001000100000007, 
effectivePageId=0000000100000007, grpId=-1237460590]
        at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:822)
        at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:696)
        at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:685)
        at 
org.apache.ignite.internal.processors.cache.persistence.DataStructure.acquirePage(DataStructure.java:156)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.acquirePage(BPlusTree.java:6041)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown(BPlusTree.java:1420)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doFind(BPlusTree.java:1397)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$8200(BPlusTree.java:98)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.find(BPlusTree.java:5563)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1103)
        ... 11 more
{code}
To increase chances to reproduce, it might help to add
{code:java}
else if (relPtr == OUTDATED_REL_PTR) {
                try {
                    Thread.sleep(1000);
                }
                catch (InterruptedException e) {
                    e.printStackTrace();
                }
                assert PageIdUtils.pageIndex(pageId) == 0 : fullId;

{code}
in 
{{org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl}}

 

The root cause of this problem was the fact, that node that was removed from 
baseline has the gap between restarting and the moment where partition exchange 
future makes initCachesOnLocalJoin and stops caches for the node, that was 
removed from baseline. TTL cleanup worker has worked in that gap and continued 
working even after caches were stopped because TTL manager 
({{GridCacheSharedTtlCleanupManager}}) caches a mapping between caches and 
managers. The solution is to unregister managers for all caches before 
{{onBaselineChange}} in {{initCachesOnLocalJoin}}

  was:
This issue is very rare, it's quite hard to reproduce on mac, some windows 
users reproduced it a bit often  

Scenario:
 # 2 baseline nodes, cache with expiry policy = 60 sec. 
 # Put some entries in the cache, stop one node immediately.
 # Remove node from baseline.
 # Wait until expiration.
 # Start the stopped node — NPE on node start.

{code:java}
[2020-05-08 16:07:17,925][ERROR][ttl-cleanup-worker-#43][root] Critical system 
error detected. Will be handled accordingly to configured handler 
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, 
err=java.lang.NullPointerException]]
java.lang.NullPointerException
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2765)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
        at 
org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
        at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
        at 
java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
        at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
        at java.lang.Thread.run(Thread.java:748)
{code}
In some cases, it is possible to get this stacktrace
{code:java}
[2020-05-25 
10:49:29,677][ERROR][ttl-cleanup-worker-#242%db.IgnitePdsWithTtlDeferredDeleteOnRestartTest2%][IgniteTestResources]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=class 
o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, val2=0]], 
groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow [], 
upper=PendingRow []]]]]
class 
org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
 B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, 
val2=0]], groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow 
[], upper=PendingRow []]]
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6110)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1119)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1083)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1078)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2742)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
        at 
org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
        at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
        at 
java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
        at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: FullPageId [pageId=0001000100000007, 
effectivePageId=0000000100000007, grpId=-1237460590]
        at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:822)
        at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:696)
        at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:685)
        at 
org.apache.ignite.internal.processors.cache.persistence.DataStructure.acquirePage(DataStructure.java:156)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.acquirePage(BPlusTree.java:6041)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown(BPlusTree.java:1420)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doFind(BPlusTree.java:1397)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$8200(BPlusTree.java:98)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.find(BPlusTree.java:5563)
        at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1103)
        ... 11 more
{code}
To increase chances to reproduce, it might help to add
{code:java}
else if (relPtr == OUTDATED_REL_PTR) {
                try {
                    Thread.sleep(1000);
                }
                catch (InterruptedException e) {
                    e.printStackTrace();
                }
                assert PageIdUtils.pageIndex(pageId) == 0 : fullId;

{code}
in 
{{org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl}}

 

The root cause of this problem was the fact, that node that was removed from 
baseline has the gap between restarting and the moment where partition exchange 
future makes initCachesOnLocalJoin and stops caches for the node, that was 
removed from baseline. TTL cleanup worker has worked in that gap and continued 
working even after caches were stopped because TTL manager 
(GridCacheSharedTtlCleanupManager) caches a mapping between caches and 
managers. The solution is to unregister managers for all caches before 
onBaselineChange in initCachesOnLocalJoin


> ttl-cleanup-worker falls with AssertionError and leads to 
> CorruptiedTreeException
> ---------------------------------------------------------------------------------
>
>                 Key: IGNITE-14093
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14093
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.9.1
>            Reporter: Mirza Aliev
>            Assignee: Mirza Aliev
>            Priority: Major
>         Attachments: IgnitePdsWithTtlDeferredDeleteOnRestartTest (1).java
>
>
> This issue is very rare, it's quite hard to reproduce on mac, some windows 
> users reproduced it a bit often  
> Scenario:
>  # 2 baseline nodes, cache with expiry policy = 60 sec. 
>  # Put some entries in the cache, stop one node immediately.
>  # Remove node from baseline.
>  # Wait until expiration.
>  # Start the stopped node — NPE on node start.
> {code:java}
> [2020-05-08 16:07:17,925][ERROR][ttl-cleanup-worker-#43][root] Critical 
> system error detected. Will be handled accordingly to configured handler 
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, 
> err=java.lang.NullPointerException]]
> java.lang.NullPointerException
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2765)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
>       at 
> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> In some cases, it is possible to get this stacktrace
> {code:java}
> [2020-05-25 
> 10:49:29,677][ERROR][ttl-cleanup-worker-#242%db.IgnitePdsWithTtlDeferredDeleteOnRestartTest2%][IgniteTestResources]
>  Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
> corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, val2=0]], 
> groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow [], 
> upper=PendingRow []]]]]
> class 
> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>  B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple 
> [val1=-1237460590, val2=0]], groupName=group1, msg=Runtime failure on bounds: 
> [lower=PendingRow [], upper=PendingRow []]]
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6110)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1119)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1083)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1078)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2742)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
>       at 
> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.AssertionError: FullPageId [pageId=0001000100000007, 
> effectivePageId=0000000100000007, grpId=-1237460590]
>       at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:822)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:696)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:685)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.acquirePage(DataStructure.java:156)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.acquirePage(BPlusTree.java:6041)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown(BPlusTree.java:1420)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doFind(BPlusTree.java:1397)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$8200(BPlusTree.java:98)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.find(BPlusTree.java:5563)
>       at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1103)
>       ... 11 more
> {code}
> To increase chances to reproduce, it might help to add
> {code:java}
> else if (relPtr == OUTDATED_REL_PTR) {
>                 try {
>                     Thread.sleep(1000);
>                 }
>                 catch (InterruptedException e) {
>                     e.printStackTrace();
>                 }
>                 assert PageIdUtils.pageIndex(pageId) == 0 : fullId;
> {code}
> in 
> {{org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl}}
>  
> The root cause of this problem was the fact, that node that was removed from 
> baseline has the gap between restarting and the moment where partition 
> exchange future makes initCachesOnLocalJoin and stops caches for the node, 
> that was removed from baseline. TTL cleanup worker has worked in that gap and 
> continued working even after caches were stopped because TTL manager 
> ({{GridCacheSharedTtlCleanupManager}}) caches a mapping between caches and 
> managers. The solution is to unregister managers for all caches before 
> {{onBaselineChange}} in {{initCachesOnLocalJoin}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to