Pavel Pereslegin created IGNITE-15429:
-----------------------------------------
Summary: Taking a snapshot may increase the PME execution time by
the checkpointFrequency interval.
Key: IGNITE-15429
URL: https://issues.apache.org/jira/browse/IGNITE-15429
Project: Ignite
Issue Type: Bug
Reporter: Pavel Pereslegin
When a snapshot is taken, a checkpoint is forced on all cluster nodes.
In a rare case, when forcing a checkpoint, the start of the snapshot operation
is set to the planned (instead of the current) checkpoint. In this case, the
local partition exchange future does not finish until the next checkpoint
starts (but timeout), which significantly increases the execution time of the
exchange.
Log output on a node with a checkpoint frequency of 60 seconds.
{noformat}
2021-08-31 23:30:04.792 [INFO
][exchange-worker-#179][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask]
Snapshot operation is scheduled on local node and will be handled by the
checkpoint l
istener [sctx=SnapshotFutureTask [pageStore=GridCacheSharedManagerAdapter
[starting=true, stop=false], srcNodeId=a49f4c59-a4d1-4b02-b416-ceede4ffc0ba,
snpName=20210831233001_snapshot, tmpSnpWorkDir=/opt/ignite/ssd/data/epk_r
b_sylvanas5_ca_sbrf_ru/snp/20210831233001_snapshot,
locBuff=java.lang.ThreadLocal$SuppliedThreadLocal@6fd06d5f,
ioFactory=org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIOFactory@54067b77,
cpEnd
Fut=java.util.concurrent.CompletableFuture@382a989[Not completed],
startedFut=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
hash=378462544],
tmpConsIdDir=/opt/ignite/ssd/data/something/snp/20210831233001_snapshot/db/something,
closeFut=null, err=null, started=true], topVer=AffinityTopologyVersion
[topVer=515, minorTopVer=2]]
2021-08-31 23:30:05.444 [INFO
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=b0e9b43d-02f4-44fb-90c5-41dc6c294248, startP
tr=FileWALPointer [idx=13473, fileOff=1008077757, len=236399],
checkpointBeforeLockTime=352ms, checkpointLockWait=0ms,
checkpointListenersExecuteTime=325ms, checkpointLockHoldTime=331ms,
walCpRecordFsyncDuration=3ms, writeCh
eckpointEntryDuration=0ms, splitAndSortCpPagesDuration=8ms, pages=15001,
reason='timeout']
2021-08-31 23:30:05.671 [INFO
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=b0e9b43d-02f4-44fb-90c5-41dc6c294248, pages=15001,
markPos=FileWALPointer [idx=13473, fileOff=1008077757, len=236399],
walSegmentsCleared=0, walSegmentsCovered=[], markDuration=343ms,
pagesWrite=128ms, fsync=99ms, total=922ms]
...60 seconds later...
2021-08-31 23:31:05.779 [INFO
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotSender]
Resolved snapshot work directory:
/opt/ignite/sas/snapshot/20210831233001_snapshot/db/something
2021-08-31 23:31:05.812 [INFO
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask]
Submit partition processing tasks with partition allocated lengths: ...
2021-08-31 23:31:05.837 [INFO
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=328ms,
checkpointLockWait=0ms, checkpointListenersExecuteTime=298ms,
checkpointLockHoldTime=304ms, reason='timeout']
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)