[jira] [Created] (IGNITE-15429) Taking a snapshot may increase the PME execution time by the checkpointFrequency interval.

Pavel Pereslegin (Jira) Thu, 02 Sep 2021 07:23:04 -0700

Pavel Pereslegin created IGNITE-15429:
-----------------------------------------


             Summary: Taking a snapshot may increase the PME execution time by 
the checkpointFrequency interval.
                 Key: IGNITE-15429
                 URL: https://issues.apache.org/jira/browse/IGNITE-15429
             Project: Ignite
          Issue Type: Bug
            Reporter: Pavel Pereslegin


When a snapshot is taken, a checkpoint is forced on all cluster nodes.

In a rare case, when forcing a checkpoint, the start of the snapshot operation 
is set to the planned (instead of the current) checkpoint. In this case, the 
local partition exchange future does not finish until the next checkpoint 
starts (but timeout), which significantly increases the execution time of the 
exchange. 

Log output on a node with a checkpoint frequency of 60 seconds.
{noformat}

2021-08-31 23:30:04.792 [INFO 
][exchange-worker-#179][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask]
 Snapshot operation is scheduled on local node and will be handled by the 
checkpoint l
istener [sctx=SnapshotFutureTask [pageStore=GridCacheSharedManagerAdapter 
[starting=true, stop=false], srcNodeId=a49f4c59-a4d1-4b02-b416-ceede4ffc0ba, 
snpName=20210831233001_snapshot, tmpSnpWorkDir=/opt/ignite/ssd/data/epk_r
b_sylvanas5_ca_sbrf_ru/snp/20210831233001_snapshot, 
locBuff=java.lang.ThreadLocal$SuppliedThreadLocal@6fd06d5f, 
ioFactory=org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIOFactory@54067b77,
 cpEnd
Fut=java.util.concurrent.CompletableFuture@382a989[Not completed], 
startedFut=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, 
hash=378462544], 
tmpConsIdDir=/opt/ignite/ssd/data/something/snp/20210831233001_snapshot/db/something,
 closeFut=null, err=null, started=true], topVer=AffinityTopologyVersion 
[topVer=515, minorTopVer=2]]
2021-08-31 23:30:05.444 [INFO 
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
 Checkpoint started [checkpointId=b0e9b43d-02f4-44fb-90c5-41dc6c294248, startP
tr=FileWALPointer [idx=13473, fileOff=1008077757, len=236399], 
checkpointBeforeLockTime=352ms, checkpointLockWait=0ms, 
checkpointListenersExecuteTime=325ms, checkpointLockHoldTime=331ms, 
walCpRecordFsyncDuration=3ms, writeCh
eckpointEntryDuration=0ms, splitAndSortCpPagesDuration=8ms,  pages=15001, 
reason='timeout']
2021-08-31 23:30:05.671 [INFO 
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
 Checkpoint finished [cpId=b0e9b43d-02f4-44fb-90c5-41dc6c294248, pages=15001,
markPos=FileWALPointer [idx=13473, fileOff=1008077757, len=236399], 
walSegmentsCleared=0, walSegmentsCovered=[], markDuration=343ms, 
pagesWrite=128ms, fsync=99ms, total=922ms]

...60 seconds later...

2021-08-31 23:31:05.779 [INFO 
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotSender]
 Resolved snapshot work directory: 
/opt/ignite/sas/snapshot/20210831233001_snapshot/db/something
2021-08-31 23:31:05.812 [INFO 
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask]
 Submit partition processing tasks with partition allocated lengths: ...

2021-08-31 23:31:05.837 [INFO 
][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
 Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=328ms,
 checkpointLockWait=0ms, checkpointListenersExecuteTime=298ms, 
checkpointLockHoldTime=304ms, reason='timeout']
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (IGNITE-15429) Taking a snapshot may increase the PME execution time by the checkpointFrequency interval.

Reply via email to