Ivan Bessonov created IGNITE-23325:
--------------------------------------
Summary: Checkpoint read-lock timeout under high load
Key: IGNITE-23325
URL: https://issues.apache.org/jira/browse/IGNITE-23325
Project: Ignite
Issue Type: Bug
Reporter: Ivan Bessonov
We encounter the following situation while having a very intensive load. It
should not happen
{noformat}
[2024-09-25T11:03:12,180][INFO ][%node_3344%checkpoint-thread][Checkpointer]
Checkpoint started [checkpointId=db0f392d-7d71-4ef1-bb50-c17eb2d02d82,
checkpointBeforeWriteLockTime=30ms, checkpointWriteLockWait=1ms,
checkpointListenersExecuteTime=2ms, checkpointWriteLockHoldTime=4ms,
splitAndSortPagesDuration=276ms, pages=80904, reason='too many dirty pages']
227808.089 ops/s
[2024-09-25T11:03:13,438][WARN
][org.apache.ignite.internal.benchmark.UpsertKvBenchmark.upsert-jmh-worker-10][PersistentPageMemory]
Page replacements started, pages will be rotated with disk, this will affect
storage performance (consider increasing
PageMemoryDataRegionConfiguration#setMaxSize for data region) [region=default]
[2024-09-25T11:03:13,796][INFO ][checkpoint-runner-io0][CheckpointPagesWriter]
Checkpoint pages were not written yet due to unsuccessful page write lock
acquisition and will be retried [pageCount=1]
[2024-09-25T11:03:14,071][INFO ][%node_3344%checkpoint-thread][Checkpointer]
Checkpoint finished [checkpointId=db0f392d-7d71-4ef1-bb50-c17eb2d02d82,
pages=80904, pagesWriteTime=1614ms, fsyncTime=273ms, totalTime=2203ms]
[2024-09-25T11:03:14,073][INFO ][%node_3344%compaction-thread][Compactor]
Starting new compaction round [files=64]
[2024-09-25T11:03:15,828][INFO ][%node_3344%checkpoint-thread][Checkpointer]
Checkpoint started [checkpointId=11157014-7a08-41e0-9e80-e26cf01656a8,
checkpointBeforeWriteLockTime=21ms, checkpointWriteLockWait=0ms,
checkpointListenersExecuteTime=6ms, checkpointWriteLockHoldTime=6ms,
splitAndSortPagesDuration=205ms, pages=77234, reason='too many dirty pages']
231068.630 ops/s
271814.068 ops/s
# Warmup Iteration 8: [2024-09-25T11:03:23,547][INFO
][%node_3344%compaction-thread][Compactor] Starting new compaction round
[files=64]
[2024-09-25T11:03:23,547][INFO ][%node_3344%checkpoint-thread][Checkpointer]
Checkpoint finished [checkpointId=11157014-7a08-41e0-9e80-e26cf01656a8,
pages=77234, pagesWriteTime=2547ms, fsyncTime=5099ms, totalTime=7951ms]
26376.850 ops/s
# Warmup Iteration 9: [2024-09-25T11:03:23,685][INFO
][%node_3344%checkpoint-thread][Checkpointer] Checkpoint started
[checkpointId=b1b2541a-e948-46cf-821b-412ea120146c,
checkpointBeforeWriteLockTime=9ms, checkpointWriteLockWait=0ms,
checkpointListenersExecuteTime=1ms, checkpointWriteLockHoldTime=1ms,
splitAndSortPagesDuration=125ms, pages=95251, reason='too many dirty pages']
[2024-09-25T11:03:34,706][INFO ][%node_3344%compaction-thread][Compactor]
Compaction round finished [duration=11159ms]
[2024-09-25T11:03:34,707][INFO ][%node_3344%compaction-thread][Compactor]
Starting new compaction round [files=41]
[2024-09-25T11:03:35,444][INFO ][%node_3344%lease-updater][LeaseUpdater] Leases
updated (printed once per 10 iteration(s)): [inCurrentIteration=LeaseStats
[leasesCreated=0, leasesPublished=0, leasesProlonged=64,
leasesWithoutCandidates=0], active=64, currentAssignmentsSize=64].
[2024-09-25T11:03:39,075][WARN
][%node_3344%meta-storage-safe-time-0][TrackableNetworkMessageHandler] Message
handling has been too long [duration=11ms, message=class
org.apache.ignite.raft.jraft.rpc.WriteActionRequestImpl]
[2024-09-25T11:03:40,287][INFO ][%node_3344%checkpoint-thread][Checkpointer]
Checkpoint finished [checkpointId=b1b2541a-e948-46cf-821b-412ea120146c,
pages=95251, pagesWriteTime=3098ms, fsyncTime=13146ms, totalTime=16740ms]
[2024-09-25T11:03:40,307][WARN
][%node_3344%JRaft-FSMCaller-Disruptor_stripe_24-0][FailureProcessor] Possible
failure suppressed according to a configured handler [hnd=NoOpFailureHandler
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
failureCtx=SYSTEM_CRITICAL_OPERATION_TIMEOUT]
org.apache.ignite.internal.lang.IgniteInternalException: Checkpoint read lock
acquisition has been timed out.
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.CheckpointTimeoutLock.failCheckpointReadLock(CheckpointTimeoutLock.java:242)
~[ignite-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:130)
~[ignite-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$runConsistently$0(PersistentPageMemoryMvPartitionStorage.java:175)
~[ignite-storage-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busy(AbstractPageMemoryMvPartitionStorage.java:659)
~[ignite-storage-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.runConsistently(PersistentPageMemoryMvPartitionStorage.java:170)
~[ignite-storage-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.raft.snapshot.outgoing.SnapshotAwarePartitionDataStorage.runConsistently(SnapshotAwarePartitionDataStorage.java:76)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.raft.PartitionListener.advanceLastAppliedIndexConsistently(PartitionListener.java:508)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.raft.PartitionListener.handleSafeTimeSyncCommand(PartitionListener.java:504)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.raft.PartitionListener.lambda$onWriteBusy$2(PartitionListener.java:250)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
~[?:?]
at
org.apache.ignite.internal.table.distributed.raft.PartitionListener.onWriteBusy(PartitionListener.java:199)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.raft.PartitionListener.onWrite(PartitionListener.java:192)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine.onApply(JraftServerImpl.java:784)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.raft.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:686)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.raft.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:654)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.raft.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:573)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:246)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:240)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.consumeBatch(StripedDisruptor.java:456)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:334)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:272)
~[ignite-raft-3.0.0-SNAPSHOT.jar:?]
at
com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:167)
~[disruptor-4.0.0.jar:?]
at
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:122)
~[disruptor-4.0.0.jar:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2024-09-25T11:03:40,321][WARN
][org.apache.ignite.internal.benchmark.UpsertKvBenchmark.upsert-jmh-worker-6][FailureProcessor]
Possible failure suppressed according to a configured handler
[hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
failureCtx=SYSTEM_CRITICAL_OPERATION_TIMEOUT]
org.apache.ignite.internal.lang.IgniteInternalException: Checkpoint read lock
acquisition has been timed out.
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.CheckpointTimeoutLock.failCheckpointReadLock(CheckpointTimeoutLock.java:242)
~[ignite-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:130)
~[ignite-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$runConsistently$0(PersistentPageMemoryMvPartitionStorage.java:175)
~[ignite-storage-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busy(AbstractPageMemoryMvPartitionStorage.java:659)
~[ignite-storage-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.runConsistently(PersistentPageMemoryMvPartitionStorage.java:170)
~[ignite-storage-page-memory-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.raft.snapshot.outgoing.SnapshotAwarePartitionDataStorage.runConsistently(SnapshotAwarePartitionDataStorage.java:76)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.StorageUpdateHandler.handleUpdate(StorageUpdateHandler.java:109)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$applyUpdateCommand$125(PartitionReplicaListener.java:2766)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.tracing.Instrumentation.measure(Instrumentation.java:217)
~[ignite-core-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$applyUpdateCommand$126(PartitionReplicaListener.java:2766)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1187)
~[?:?]
at
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2341)
~[?:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.applyUpdateCommand(PartitionReplicaListener.java:2754)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.applyUpdateCommand(PartitionReplicaListener.java:2805)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processSingleEntryAction$147(PartitionReplicaListener.java:3053)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1187)
~[?:?]
at
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2341)
~[?:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processSingleEntryAction$149(PartitionReplicaListener.java:3052)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1187)
~[?:?]
at
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2341)
~[?:?]
at
org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$processSingleEntryAction$152(PartitionReplicaListener.java:3050)
~[ignite-table-3.0.0-SNAPSHOT.jar:?]{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)