[
https://issues.apache.org/jira/browse/IGNITE-15429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409520#comment-17409520
]
Pavel Pereslegin commented on IGNITE-15429:
-------------------------------------------
[~mmuzaf],
could you review this small patch?
> Taking a snapshot may increase the PME execution time by the
> checkpointFrequency interval.
> ------------------------------------------------------------------------------------------
>
> Key: IGNITE-15429
> URL: https://issues.apache.org/jira/browse/IGNITE-15429
> Project: Ignite
> Issue Type: Bug
> Reporter: Pavel Pereslegin
> Assignee: Pavel Pereslegin
> Priority: Major
> Labels: iep-43
> Fix For: 2.12
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When a snapshot is taken, a checkpoint is forced on all server nodes.
> In a rare case, when forcing a checkpoint, the start of the snapshot
> operation is set to the planned (instead of the current) checkpoint. In this
> case, the local partition exchange future does not finish until the next
> checkpoint starts (by timeout), which significantly increases the execution
> time of the exchange.
> Log output on a node with a checkpoint frequency of 60 seconds
> (_stage="Exchange done" (61542 ms)_).
> {noformat}
> 2021-08-31 23:30:04.792 [INFO
> ][exchange-worker-#179][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask]
> Snapshot operation is scheduled on local node and will be handled by the
> checkpoint listener [sctx=SnapshotFutureTask
> [pageStore=GridCacheSharedManagerAdapter [starting=true, stop=false],
> srcNodeId=a49f4c59-a4d1-4b02-b416-ceede4ffc0ba,
> snpName=20210831233001_snapshot,
> tmpSnpWorkDir=/opt/ignite/ssd/data/something/snp/20210831233001_snapshot,
> locBuff=java.lang.ThreadLocal$SuppliedThreadLocal@6fd06d5f,
> ioFactory=org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIOFactory@54067b77,
> cpEndFut=java.util.concurrent.CompletableFuture@382a989[Not completed],
> startedFut=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
> hash=378462544],
> tmpConsIdDir=/opt/ignite/ssd/data/something/snp/20210831233001_snapshot/db/something,
> closeFut=null, err=null, started=true], topVer=AffinityTopologyVersion
> [topVer=515, minorTopVer=2]]
> 2021-08-31 23:30:05.444 [INFO
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=b0e9b43d-02f4-44fb-90c5-41dc6c294248,
> startPtr=FileWALPointer [idx=13473, fileOff=1008077757, len=236399],
> checkpointBeforeLockTime=352ms, checkpointLockWait=0ms,
> checkpointListenersExecuteTime=325ms, checkpointLockHoldTime=331ms,
> walCpRecordFsyncDuration=3ms, writeCheckpointEntryDuration=0ms,
> splitAndSortCpPagesDuration=8ms, pages=15001, reason='timeout']
> 2021-08-31 23:30:05.671 [INFO
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=b0e9b43d-02f4-44fb-90c5-41dc6c294248,
> pages=15001,markPos=FileWALPointer [idx=13473, fileOff=1008077757,
> len=236399], walSegmentsCleared=0, walSegmentsCovered=[], markDuration=343ms,
> pagesWrite=128ms, fsync=99ms, total=922ms]
> ...60 seconds later...
> 2021-08-31 23:31:05.779 [INFO
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotSender]
> Resolved snapshot work directory:
> /opt/ignite/sas/snapshot/20210831233001_snapshot/db/something
> 2021-08-31 23:31:05.812 [INFO
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask]
> Submit partition processing tasks with partition allocated lengths: ...
> 2021-08-31 23:31:05.837 [INFO
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
> Skipping checkpoint (no pages were modified)
> [checkpointBeforeLockTime=328ms, checkpointLockWait=0ms,
> checkpointListenersExecuteTime=298ms, checkpointLockHoldTime=304ms,
> reason='timeout']
> ...
> 2021-08-31 23:31:06.315 [INFO
> ][exchange-worker-#179][org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture]
> Completed partition exchange
> [localNode=1bd343d0-d6b5-4ea3-880b-ebda77bc48d7,
> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
> [topVer=515, minorTopVer=2], evt=DISCOVERY_CUSTOM_EVT,
> evtNode=TcpDiscoveryNode [id=a49f4c59-a4d1-4b02-b416-ceede4ffc0ba,
> consistentId=epk_rb_sylvanas40.ca.sbrf.ru, addrs=ArrayList [10.126.13.32,
> 127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500,
> sylvanas40.ca.sbrf.ru/10.126.13.32:47500], discPort=47500, order=60,
> intOrder=60, lastExchangeTime=1626477723350, loc=false,
> ver=2.9.0#20210705-sha1:b32c94b2, isClient=false], rebalanced=true,
> done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=515,
> minorTopVer=2]]
> 2021-08-31 23:31:06.315 [INFO
> ][exchange-worker-#179][org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture]
> Exchange timing [startVer=AffinityTopologyVersion [topVer=515,
> minorTopVer=2], resVer=AffinityTopologyVersion [topVer=515, minorTopVer=2],
> stage="Waiting in exchange queue" (0 ms), stage="Exchange parameters
> initialization" (1 ms), stage="Determine exchange type" (24 ms),
> stage="Preloading notification" (0 ms), stage="Wait partitions release
> [latch=exchange]" (0 ms), stage="Wait partitions release latch
> [latch=exchange]" (40 ms), stage="Wait partitions release [latch=exchange]"
> (0 ms), stage="After states restored callback" (0 ms), stage="WAL history
> reservation" (12 ms), stage="Exchange done" (61542 ms), stage="Total time"
> (61619 ms)]
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)