[
https://issues.apache.org/jira/browse/IGNITE-26037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirill Tkalenko updated IGNITE-26037:
-------------------------------------
Description:
When analyzing the log, I found an error when saving FreeList metadata, which
led to a Checkpointer crash, and this, as a consequence, leads to an
inoperative node. This needs to be sorted out.
What scenario, there was a cluster of three nodes on which a lot of data was
loaded, all tables were in a zone with a replica count of 1. After loading all
the data, the replica count was changed from 1 to 3, which led to multiple
rebalancings via raft snapshots. After some time, this problem appeared. The
exception itself, as far as I understand, occurred while saving a raft snapshot.
This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.
h3. {color:red}Update{color}
Root cause of the problem is a race between recreating the storage structures
at the start of its rebalance and at the checkpoint. There may be a small
chance for the *closed* FreeList to try to trigger the metadata sync at the
checkpoint, which causes an error at the checkpoint and the node to shut down.
In my opinion, the correct fix would be if before closing the structures we
remove the checkpoint listener that synchronizes the FreeList metadata and
after it is recreated, return the listener. There may also be a small chance
that the checkpoint will start executing a callback for the closed FreeList
before the listener is removed, so we need to take that into account.
{noformat}
2025-07-24 14:11:42:486 +0000 [ERROR][%node1%checkpoint-thread][FailureManager]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
org.apache.ignite.internal.failure.StackTraceCapturingException: IGN-CMN-65535
Unknown error TraceId:00fda422
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
at
org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException:
IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c,
effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
... 2 more
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError:
FullPageId [pageId=000100020000003c, effectivePageId=000000020000003c,
groupId=38]
at
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
at
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
at
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
... 1 more
Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c,
effectivePageId=000000020000003c, groupId=38]
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
at
org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
at
org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
at
org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
at
org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
at
org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
at
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
... 3 more
{noformat}
was:
When analyzing the log, I found an error when saving FreeList metadata, which
led to a Checkpointer crash, and this, as a consequence, leads to an
inoperative node. This needs to be sorted out.
What scenario, there was a cluster of three nodes on which a lot of data was
loaded, all tables were in a zone with a replica count of 1. After loading all
the data, the replica count was changed from 1 to 3, which led to multiple
rebalancings via raft snapshots. After some time, this problem appeared. The
exception itself, as far as I understand, occurred while saving a raft snapshot.
This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.
{noformat}
2025-07-24 14:11:42:486 +0000 [ERROR][%node1%checkpoint-thread][FailureManager]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
org.apache.ignite.internal.failure.StackTraceCapturingException: IGN-CMN-65535
Unknown error TraceId:00fda422
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
at
org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException:
IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c,
effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
... 2 more
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError:
FullPageId [pageId=000100020000003c, effectivePageId=000000020000003c,
groupId=38]
at
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
at
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
at
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
... 1 more
Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c,
effectivePageId=000000020000003c, groupId=38]
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
at
org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
at
org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
at
org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
at
org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
at
org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
at
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
... 3 more
{noformat}
> Error saving FreeList metadata causing checkpointer to crash
> ------------------------------------------------------------
>
> Key: IGNITE-26037
> URL: https://issues.apache.org/jira/browse/IGNITE-26037
> Project: Ignite
> Issue Type: Bug
> Reporter: Kirill Tkalenko
> Assignee: Kirill Tkalenko
> Priority: Major
> Labels: ignite-3
> Fix For: 3.1
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> When analyzing the log, I found an error when saving FreeList metadata, which
> led to a Checkpointer crash, and this, as a consequence, leads to an
> inoperative node. This needs to be sorted out.
> What scenario, there was a cluster of three nodes on which a lot of data was
> loaded, all tables were in a zone with a replica count of 1. After loading
> all the data, the replica count was changed from 1 to 3, which led to
> multiple rebalancings via raft snapshots. After some time, this problem
> appeared. The exception itself, as far as I understand, occurred while saving
> a raft snapshot.
> This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.
> h3. {color:red}Update{color}
> Root cause of the problem is a race between recreating the storage structures
> at the start of its rebalance and at the checkpoint. There may be a small
> chance for the *closed* FreeList to try to trigger the metadata sync at the
> checkpoint, which causes an error at the checkpoint and the node to shut down.
> In my opinion, the correct fix would be if before closing the structures we
> remove the checkpoint listener that synchronizes the FreeList metadata and
> after it is recreated, return the listener. There may also be a small chance
> that the checkpoint will start executing a callback for the closed FreeList
> before the listener is removed, so we need to take that into account.
> {noformat}
> 2025-07-24 14:11:42:486 +0000
> [ERROR][%node1%checkpoint-thread][FailureManager] Critical system error
> detected. Will be handled accordingly to configured handler
> [hnd=NoOpFailureHandler [super=AbstractFailureHandler
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
> org.apache.ignite.internal.failure.StackTraceCapturingException:
> IGN-CMN-65535 Unknown error TraceId:00fda422
> at
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
> at
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
> at
> org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
> at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException:
> IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c,
> effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
> ... 2 more
> Caused by: java.util.concurrent.CompletionException:
> java.lang.AssertionError: FullPageId [pageId=000100020000003c,
> effectivePageId=000000020000003c, groupId=38]
> at
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
> at
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
> at
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
> at
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> at
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> ... 1 more
> Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c,
> effectivePageId=000000020000003c, groupId=38]
> at
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
> at
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
> at
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
> at
> org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
> at
> org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
> at
> org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
> at
> org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
> at
> org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
> at
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
> at
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
> at
> org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
> at
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
> ... 3 more
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)