Alexei Scherbakov created IGNITE-9612:
-----------------------------------------

             Summary: Improve checkpoint mark phase speed.
                 Key: IGNITE-9612
                 URL: https://issues.apache.org/jira/browse/IGNITE-9612
             Project: Ignite
          Issue Type: Improvement
            Reporter: Alexei Scherbakov
            Assignee: Alexei Scherbakov
             Fix For: 2.7


I'm observing regular slow checkpoints due to long mark duration, which is not 
related to dirty pages number:

{noformat}
2018-09-01 14:55:20.408 [INFO 
][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager]
 Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, 
startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457],
checkpointLockWait=0ms, checkpointLockHoldTime=951ms, 
walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout']
2018-09-01 14:55:21.307 [INFO 
][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager]
 Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, 
markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], 
walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, 
pagesWrite=478ms, fsync=421ms, total=1901ms] 
{noformat}

{noformat}
2018-09-01 14:58:20.355 [INFO 
][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager]
 Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, 
startPtr=FileWALPointer [idx=19814, fileOff=1000024208, len=5233457], 
checkpointLockWait=0ms, checkpointLockHoldTime=926ms, 
walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout']
2018-09-01 14:58:20.480 [INFO 
][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager]
 Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, 
markPos=FileWALPointer [idx=19814, fileOff=1000024208, len=5233457], 
walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, 
pagesWrite=64ms, fsync=61ms, total=1068ms]
{noformat}

Debugging has revealed what this is due to large amount of work required to 
save metadata for metapages and free/reuse lists. Because this is done under 
checkpoint write lock, all other activities are blocked, resulting in increased 
tx and atomic ops latency.

Simple solution: parallelize metadata processing during mark phase.

Best way to solve the problem is described in IGNITE-9520.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to