[ https://issues.apache.org/jira/browse/IGNITE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628858#comment-16628858 ]
Alexei Scherbakov commented on IGNITE-9612: ------------------------------------------- Looks like many test failures are related to new system thread liveness check functionality. I see many SYSTEM_WORKER_BLOCKED messages in logs resulting in hanging of failing tests. For example, look at [1], CpTriggeredWalDeltaConsistencyTest.testPutRemoveCacheDestroy The test always fails in master and in PR branch because fsync-ing of large amount of pages on checkpoint takes longer than 10 sec on average hardware (default failure detection timeout) I suggest to disable liveness checking during TC test runs due to possible thread block detection false positives because of TC overload or JVM freezes, whatever. [~agoncharuk], Please review. [1] https://ci.ignite.apache.org/viewLog.html?buildId=1939475&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_Pds1# > Improve checkpoint mark phase speed. > ------------------------------------ > > Key: IGNITE-9612 > URL: https://issues.apache.org/jira/browse/IGNITE-9612 > Project: Ignite > Issue Type: Improvement > Reporter: Alexei Scherbakov > Assignee: Alexei Scherbakov > Priority: Major > Fix For: 2.7 > > > I'm observing regular slow checkpoints due to long mark duration, which is > not related to dirty pages number: > {noformat} > 2018-09-01 14:55:20.408 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, > startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=951ms, > walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout'] > 2018-09-01 14:55:21.307 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, > markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, > pagesWrite=478ms, fsync=421ms, total=1901ms] > {noformat} > {noformat} > 2018-09-01 14:58:20.355 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, > startPtr=FileWALPointer [idx=19814, fileOff=1000024208, len=5233457], > checkpointLockWait=0ms, checkpointLockHoldTime=926ms, > walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout'] > 2018-09-01 14:58:20.480 [INFO > ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] > Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, > markPos=FileWALPointer [idx=19814, fileOff=1000024208, len=5233457], > walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, > pagesWrite=64ms, fsync=61ms, total=1068ms] > {noformat} > Debugging has revealed what this is due to large amount of work required to > save metadata for metapages and free/reuse lists. Because this is done under > checkpoint write lock, all other activities are blocked, resulting in > increased tx and atomic ops latency. > Simple solution: parallelize metadata processing during mark phase. > Best way to solve the problem is described in IGNITE-9520. -- This message was sent by Atlassian JIRA (v7.6.3#76005)