[ 
https://issues.apache.org/jira/browse/IGNITE-26037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Tkalenko updated IGNITE-26037:
-------------------------------------
    Description: 
When analyzing the log, I found an error when saving FreeList metadata, which 
led to a Checkpointer crash, and this, as a consequence, leads to an 
inoperative node. This needs to be sorted out.

What scenario, there was a cluster of three nodes on which a lot of data was 
loaded, all tables were in a zone with a replica count of 1. After loading all 
the data, the replica count was changed from 1 to 3, which led to multiple 
rebalancings via raft snapshots. After some time, this problem appeared. The 
exception itself, as far as I understand, occurred while saving a raft snapshot.

This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.

h3. {color:red}Update{color}
Root cause of the problem is a race between recreating the storage structures 
at the start of its rebalance and at the checkpoint. There may be a small 
chance for the *closed* FreeList to try to trigger the metadata sync at the 
checkpoint, which causes an error at the checkpoint and the node to shut down.

In my opinion, the correct fix would be if before closing the structures we 
remove the checkpoint listener that synchronizes the FreeList metadata and 
after it is recreated, return the listener. There may also be a small chance 
that the checkpoint will start executing a callback for the closed FreeList 
before the listener is removed, so we need to take that into account.

{noformat}
2025-07-24 14:11:42:486 +0000 [ERROR][%node1%checkpoint-thread][FailureManager] 
Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
org.apache.ignite.internal.failure.StackTraceCapturingException: IGN-CMN-65535 
Unknown error TraceId:00fda422
        at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
        at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
        at 
org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException: 
IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
        ... 2 more
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError: 
FullPageId [pageId=000100020000003c, effectivePageId=000000020000003c, 
groupId=38]
        at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
        at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
        at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
        at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        ... 1 more
Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
effectivePageId=000000020000003c, groupId=38]
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
        at 
org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
        at 
org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
        at 
org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
        at 
org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
        at 
org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
        at 
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
        ... 3 more
{noformat}

  was:
When analyzing the log, I found an error when saving FreeList metadata, which 
led to a Checkpointer crash, and this, as a consequence, leads to an 
inoperative node. This needs to be sorted out.

What scenario, there was a cluster of three nodes on which a lot of data was 
loaded, all tables were in a zone with a replica count of 1. After loading all 
the data, the replica count was changed from 1 to 3, which led to multiple 
rebalancings via raft snapshots. After some time, this problem appeared. The 
exception itself, as far as I understand, occurred while saving a raft snapshot.

This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.

{noformat}
2025-07-24 14:11:42:486 +0000 [ERROR][%node1%checkpoint-thread][FailureManager] 
Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
org.apache.ignite.internal.failure.StackTraceCapturingException: IGN-CMN-65535 
Unknown error TraceId:00fda422
        at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
        at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
        at 
org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException: 
IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
        ... 2 more
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError: 
FullPageId [pageId=000100020000003c, effectivePageId=000000020000003c, 
groupId=38]
        at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
        at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
        at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
        at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        ... 1 more
Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
effectivePageId=000000020000003c, groupId=38]
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
        at 
org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
        at 
org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
        at 
org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
        at 
org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
        at 
org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
        at 
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
        ... 3 more
{noformat}


> Error saving FreeList metadata causing checkpointer to crash
> ------------------------------------------------------------
>
>                 Key: IGNITE-26037
>                 URL: https://issues.apache.org/jira/browse/IGNITE-26037
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Kirill Tkalenko
>            Assignee: Kirill Tkalenko
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.1
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> When analyzing the log, I found an error when saving FreeList metadata, which 
> led to a Checkpointer crash, and this, as a consequence, leads to an 
> inoperative node. This needs to be sorted out.
> What scenario, there was a cluster of three nodes on which a lot of data was 
> loaded, all tables were in a zone with a replica count of 1. After loading 
> all the data, the replica count was changed from 1 to 3, which led to 
> multiple rebalancings via raft snapshots. After some time, this problem 
> appeared. The exception itself, as far as I understand, occurred while saving 
> a raft snapshot.
> This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.
> h3. {color:red}Update{color}
> Root cause of the problem is a race between recreating the storage structures 
> at the start of its rebalance and at the checkpoint. There may be a small 
> chance for the *closed* FreeList to try to trigger the metadata sync at the 
> checkpoint, which causes an error at the checkpoint and the node to shut down.
> In my opinion, the correct fix would be if before closing the structures we 
> remove the checkpoint listener that synchronizes the FreeList metadata and 
> after it is recreated, return the listener. There may also be a small chance 
> that the checkpoint will start executing a callback for the closed FreeList 
> before the listener is removed, so we need to take that into account.
> {noformat}
> 2025-07-24 14:11:42:486 +0000 
> [ERROR][%node1%checkpoint-thread][FailureManager] Critical system error 
> detected. Will be handled accordingly to configured handler 
> [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
> org.apache.ignite.internal.failure.StackTraceCapturingException: 
> IGN-CMN-65535 Unknown error TraceId:00fda422
>       at 
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
>       at 
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
>       at 
> org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
>       at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException: 
> IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
>       ... 2 more
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38]
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       ... 1 more
> Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38]
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
>       at 
> org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
>       at 
> org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
>       at 
> org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
>       at 
> org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
>       at 
> org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
>       ... 3 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to