[jira] [Commented] (IGNITE-15429) Taking a snapshot may increase the PME execution time by the checkpointFrequency interval.

Pavel Pereslegin (Jira) Fri, 03 Sep 2021 06:57:06 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-15429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409520#comment-17409520
 ]


Pavel Pereslegin commented on IGNITE-15429:
-------------------------------------------

[~mmuzaf],

could you review this small patch?

> Taking a snapshot may increase the PME execution time by the 
> checkpointFrequency interval.
> ------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-15429
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15429
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Pavel Pereslegin
>            Assignee: Pavel Pereslegin
>            Priority: Major
>              Labels: iep-43
>             Fix For: 2.12
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a snapshot is taken, a checkpoint is forced on all server nodes.
> In a rare case, when forcing a checkpoint, the start of the snapshot 
> operation is set to the planned (instead of the current) checkpoint. In this 
> case, the local partition exchange future does not finish until the next 
> checkpoint starts (by timeout), which significantly increases the execution 
> time of the exchange. 
> Log output on a node with a checkpoint frequency of 60 seconds 
> (_stage="Exchange done" (61542 ms)_).
> {noformat}
> 2021-08-31 23:30:04.792 [INFO 
> ][exchange-worker-#179][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask]
>  Snapshot operation is scheduled on local node and will be handled by the 
> checkpoint listener [sctx=SnapshotFutureTask 
> [pageStore=GridCacheSharedManagerAdapter [starting=true, stop=false], 
> srcNodeId=a49f4c59-a4d1-4b02-b416-ceede4ffc0ba, 
> snpName=20210831233001_snapshot, 
> tmpSnpWorkDir=/opt/ignite/ssd/data/something/snp/20210831233001_snapshot, 
> locBuff=java.lang.ThreadLocal$SuppliedThreadLocal@6fd06d5f, 
> ioFactory=org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIOFactory@54067b77,
>  cpEndFut=java.util.concurrent.CompletableFuture@382a989[Not completed], 
> startedFut=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, 
> hash=378462544], 
> tmpConsIdDir=/opt/ignite/ssd/data/something/snp/20210831233001_snapshot/db/something,
>  closeFut=null, err=null, started=true], topVer=AffinityTopologyVersion 
> [topVer=515, minorTopVer=2]]
> 2021-08-31 23:30:05.444 [INFO 
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
>  Checkpoint started [checkpointId=b0e9b43d-02f4-44fb-90c5-41dc6c294248, 
> startPtr=FileWALPointer [idx=13473, fileOff=1008077757, len=236399], 
> checkpointBeforeLockTime=352ms, checkpointLockWait=0ms, 
> checkpointListenersExecuteTime=325ms, checkpointLockHoldTime=331ms, 
> walCpRecordFsyncDuration=3ms, writeCheckpointEntryDuration=0ms, 
> splitAndSortCpPagesDuration=8ms,  pages=15001, reason='timeout']
> 2021-08-31 23:30:05.671 [INFO 
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
>  Checkpoint finished [cpId=b0e9b43d-02f4-44fb-90c5-41dc6c294248, 
> pages=15001,markPos=FileWALPointer [idx=13473, fileOff=1008077757, 
> len=236399], walSegmentsCleared=0, walSegmentsCovered=[], markDuration=343ms, 
> pagesWrite=128ms, fsync=99ms, total=922ms]
> ...60 seconds later...
> 2021-08-31 23:31:05.779 [INFO 
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotSender]
>  Resolved snapshot work directory: 
> /opt/ignite/sas/snapshot/20210831233001_snapshot/db/something
> 2021-08-31 23:31:05.812 [INFO 
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask]
>  Submit partition processing tasks with partition allocated lengths: ...
> 2021-08-31 23:31:05.837 [INFO 
> ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager]
>  Skipping checkpoint (no pages were modified) 
> [checkpointBeforeLockTime=328ms, checkpointLockWait=0ms, 
> checkpointListenersExecuteTime=298ms, checkpointLockHoldTime=304ms, 
> reason='timeout']
> ...
> 2021-08-31 23:31:06.315 [INFO 
> ][exchange-worker-#179][org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture]
>  Completed partition exchange 
> [localNode=1bd343d0-d6b5-4ea3-880b-ebda77bc48d7, 
> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion 
> [topVer=515, minorTopVer=2], evt=DISCOVERY_CUSTOM_EVT, 
> evtNode=TcpDiscoveryNode [id=a49f4c59-a4d1-4b02-b416-ceede4ffc0ba, 
> consistentId=epk_rb_sylvanas40.ca.sbrf.ru, addrs=ArrayList [10.126.13.32, 
> 127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500, 
> sylvanas40.ca.sbrf.ru/10.126.13.32:47500], discPort=47500, order=60, 
> intOrder=60, lastExchangeTime=1626477723350, loc=false, 
> ver=2.9.0#20210705-sha1:b32c94b2, isClient=false], rebalanced=true, 
> done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=515, 
> minorTopVer=2]]
> 2021-08-31 23:31:06.315 [INFO 
> ][exchange-worker-#179][org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture]
>  Exchange timing [startVer=AffinityTopologyVersion [topVer=515, 
> minorTopVer=2], resVer=AffinityTopologyVersion [topVer=515, minorTopVer=2], 
> stage="Waiting in exchange queue" (0 ms), stage="Exchange parameters 
> initialization" (1 ms), stage="Determine exchange type" (24 ms), 
> stage="Preloading notification" (0 ms), stage="Wait partitions release 
> [latch=exchange]" (0 ms), stage="Wait partitions release latch 
> [latch=exchange]" (40 ms), stage="Wait partitions release [latch=exchange]" 
> (0 ms), stage="After states restored callback" (0 ms), stage="WAL history 
> reservation" (12 ms), stage="Exchange done" (61542 ms), stage="Total time" 
> (61619 ms)]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-15429) Taking a snapshot may increase the PME execution time by the checkpointFrequency interval.

Reply via email to