[
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy Pavlov updated IGNITE-7196:
-----------------------------------
Description:
Exchange can stuck and wait while new node restoring state from disk and
starting caches, there's a log snippet from a just joined new node that shows
the issue:
{noformat}
[21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started
exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1,
customEvt=null, allowMerge=true]
[21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
Resolved page store work directory:
/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
Resolved write ahead log work directory:
/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
Resolved write ahead log archive directory:
/mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
Started write-ahead log manager [mode=DEFAULT]
[21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl]
Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4
KiB, checkpointBuffer=100.0 MiB]
[21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl]
Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6
MiB, checkpointBuffer=896.0 MiB]
[21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Read checkpoint status
[startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
[21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Checking memory state [lastValidPos=FileWALPointer [idx=3582,
fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer
[idx=3629, fileOffset=50829700, len=9229, forceFlush=false],
lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
[21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Ignite node stopped in the middle of checkpoint. Will restore memory state and
finish checkpoint on node start.
[21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.17.115:57148]
[21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af,
pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, forceFlush=false]]
[21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Finished applying memory changes [changesApplied=165103, time=8189ms]
[21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.28.10:47964]
[21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.27.101:46000]
[21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to wait
for initial partition map exchange. Possible reasons are:
^-- Transactions in deadlock.
^-- Long running transactions (ignore if this is the case).
^-- Unreleased explicit locks.
[21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still waiting
for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture
[firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1,
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268,
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101],
sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500],
discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, loc=false,
ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false],
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1,
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268,
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter
[ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false,
lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null,
initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null,
done=false, state=SRV, evtLatch=0,
remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68,
ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f,
f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164,
979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef],
super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
hash=542833742]]]
[21:37:12,866][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, pages=157778,
markPos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229,
forceFlush=false], pagesWrite=5460ms, fsync=45781ms, total=51241ms]
[21:37:12,878][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
Resuming logging to WAL segment
[file=/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/0000000000000009.wal,
offset=50838929, ver=1]
[21:37:13,028][WARNING][main][GridCachePartitionExchangeManager] Still waiting
for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture
[firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1,
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324,
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101],
sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500],
discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, loc=false,
ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false],
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1,
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324,
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter
[ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false,
lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null,
initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null,
done=false, state=SRV, evtLatch=0,
remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68,
ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f,
f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164,
979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef],
super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
hash=542833742]]]
[21:37:13,116][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor]
Started cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc,
mode=REPLICATED, atomicity=TRANSACTIONAL]
[21:37:13,125][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor]
Started cache [name=statementp, memoryPolicyName=Default_Region,
mode=PARTITIONED, atomicity=TRANSACTIONAL]
[21:37:13,131][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
Finished waiting for partition release future [topVer=AffinityTopologyVersion
[topVer=57, minorTopVer=0], waitTime=0ms, futInfo=NA]
[21:37:13,133][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Read checkpoint status
[startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-END.bin]
[21:37:13,134][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Applying lost cache updates since last checkpoint record
[lastMarked=FileWALPointer [idx=3629, fileOffset=50829700, len=9229,
forceFlush=false], lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
[21:37:13,738][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Finished applying WAL changes [updatesApplied=0, time=607ms]
[21:37:13,749][INFO][exchange-worker-#62%statement_grid%][time] Finished
exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
crd=false]
[21:37:13,782][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
Received full message, will finish exchange
[node=56c97317-26cf-43d2-bf76-0cab59c6fa5f, resVer=AffinityTopologyVersion
[topVer=57, minorTopVer=0]]
[21:37:13,796][INFO][grid-nio-worker-tcp-comm-4-#45%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.24.98:58928]
[21:37:13,796][INFO][grid-nio-worker-tcp-comm-3-#44%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.27.95:52532]
[21:37:13,804][INFO][grid-nio-worker-tcp-comm-5-#46%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.17.174:54358]
[21:37:13,805][INFO][grid-nio-worker-tcp-comm-6-#47%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.23.109:42528]
[21:37:17,569][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
err=null]
{noformat}
was:
Exchange can stuck and wait while new node restoring state from disk and
starting caches, there's a log snippet from a just joined new node that shows
the issue:
[21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started
exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1,
customEvt=null, allowMerge=true]
[21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
Resolved page store work directory:
/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
Resolved write ahead log work directory:
/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
Resolved write ahead log archive directory:
/mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
Started write-ahead log manager [mode=DEFAULT]
[21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl]
Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4
KiB, checkpointBuffer=100.0 MiB]
[21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl]
Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6
MiB, checkpointBuffer=896.0 MiB]
[21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Read checkpoint status
[startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
[21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Checking memory state [lastValidPos=FileWALPointer [idx=3582,
fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer
[idx=3629, fileOffset=50829700, len=9229, forceFlush=false],
lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
[21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Ignite node stopped in the middle of checkpoint. Will restore memory state and
finish checkpoint on node start.
[21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.17.115:57148]
[21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af,
pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, forceFlush=false]]
[21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Finished applying memory changes [changesApplied=165103, time=8189ms]
[21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.28.10:47964]
[21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.27.101:46000]
[21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to wait
for initial partition map exchange. Possible reasons are:
^-- Transactions in deadlock.
^-- Long running transactions (ignore if this is the case).
^-- Unreleased explicit locks.
[21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still waiting
for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture
[firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1,
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268,
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101],
sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500],
discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, loc=false,
ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false],
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1,
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268,
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter
[ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false,
lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null,
initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null,
done=false, state=SRV, evtLatch=0,
remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68,
ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f,
f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164,
979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef],
super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
hash=542833742]]]
[21:37:12,866][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, pages=157778,
markPos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229,
forceFlush=false], pagesWrite=5460ms, fsync=45781ms, total=51241ms]
[21:37:12,878][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
Resuming logging to WAL segment
[file=/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/0000000000000009.wal,
offset=50838929, ver=1]
[21:37:13,028][WARNING][main][GridCachePartitionExchangeManager] Still waiting
for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture
[firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1,
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324,
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101],
sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500],
discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, loc=false,
ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false],
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1,
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324,
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter
[ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false,
lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null,
initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null,
done=false, state=SRV, evtLatch=0,
remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68,
ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f,
f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164,
979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef],
super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
hash=542833742]]]
[21:37:13,116][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor]
Started cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc,
mode=REPLICATED, atomicity=TRANSACTIONAL]
[21:37:13,125][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor]
Started cache [name=statementp, memoryPolicyName=Default_Region,
mode=PARTITIONED, atomicity=TRANSACTIONAL]
[21:37:13,131][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
Finished waiting for partition release future [topVer=AffinityTopologyVersion
[topVer=57, minorTopVer=0], waitTime=0ms, futInfo=NA]
[21:37:13,133][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Read checkpoint status
[startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-END.bin]
[21:37:13,134][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Applying lost cache updates since last checkpoint record
[lastMarked=FileWALPointer [idx=3629, fileOffset=50829700, len=9229,
forceFlush=false], lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
[21:37:13,738][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Finished applying WAL changes [updatesApplied=0, time=607ms]
[21:37:13,749][INFO][exchange-worker-#62%statement_grid%][time] Finished
exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
crd=false]
[21:37:13,782][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
Received full message, will finish exchange
[node=56c97317-26cf-43d2-bf76-0cab59c6fa5f, resVer=AffinityTopologyVersion
[topVer=57, minorTopVer=0]]
[21:37:13,796][INFO][grid-nio-worker-tcp-comm-4-#45%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.24.98:58928]
[21:37:13,796][INFO][grid-nio-worker-tcp-comm-3-#44%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.27.95:52532]
[21:37:13,804][INFO][grid-nio-worker-tcp-comm-5-#46%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.17.174:54358]
[21:37:13,805][INFO][grid-nio-worker-tcp-comm-6-#47%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
rmtAddr=/172.31.23.109:42528]
[21:37:17,569][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
err=null]
> Exchange can stuck and wait while new node restoring state from disk and
> starting caches
> ----------------------------------------------------------------------------------------
>
> Key: IGNITE-7196
> URL: https://issues.apache.org/jira/browse/IGNITE-7196
> Project: Ignite
> Issue Type: Bug
> Components: cache
> Affects Versions: 2.3
> Reporter: Mikhail Cherkasov
> Assignee: Maxim Muzafarov
> Priority: Critical
> Fix For: 2.7
>
>
> Exchange can stuck and wait while new node restoring state from disk and
> starting caches, there's a log snippet from a just joined new node that shows
> the issue:
> {noformat}
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1,
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
> Resolved page store work directory:
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
> Resolved write ahead log work directory:
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
> Resolved write ahead log archive directory:
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
> Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl]
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl]
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Read checkpoint status
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Checking memory state [lastValidPos=FileWALPointer [idx=3582,
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false],
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Ignite node stopped in the middle of checkpoint. Will restore memory state
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af,
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229,
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still
> waiting for initial partition map exchange
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1,
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209],
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268,
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
> crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f,
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101],
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
> ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500],
> discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830,
> loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false],
> exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57,
> minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
> [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo,
> 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500,
> /127.0.0.1:48500,
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268,
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
> nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter
> [ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false,
> lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null,
> initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null,
> done=false, state=SRV, evtLatch=0,
> remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68,
> ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f,
> f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164,
> 979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef],
> super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
> hash=542833742]]]
> [21:37:12,866][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af,
> pages=157778, markPos=FileWALPointer [idx=3629, fileOffset=50829700,
> len=9229, forceFlush=false], pagesWrite=5460ms, fsync=45781ms, total=51241ms]
> [21:37:12,878][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
> Resuming logging to WAL segment
> [file=/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/0000000000000009.wal,
> offset=50838929, ver=1]
> [21:37:13,028][WARNING][main][GridCachePartitionExchangeManager] Still
> waiting for initial partition map exchange
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1,
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209],
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324,
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
> crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f,
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101],
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500,
> ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500],
> discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830,
> loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false],
> exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57,
> minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
> [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo,
> 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500,
> /127.0.0.1:48500,
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324,
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57,
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
> nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter
> [ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false,
> lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null,
> initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null,
> done=false, state=SRV, evtLatch=0,
> remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68,
> ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f,
> f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164,
> 979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef],
> super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null,
> hash=542833742]]]
> [21:37:13,116][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor]
> Started cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc,
> mode=REPLICATED, atomicity=TRANSACTIONAL]
> [21:37:13,125][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor]
> Started cache [name=statementp, memoryPolicyName=Default_Region,
> mode=PARTITIONED, atomicity=TRANSACTIONAL]
> [21:37:13,131][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
> Finished waiting for partition release future
> [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], waitTime=0ms,
> futInfo=NA]
> [21:37:13,133][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Read checkpoint status
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-END.bin]
> [21:37:13,134][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Applying lost cache updates since last checkpoint record
> [lastMarked=FileWALPointer [idx=3629, fileOffset=50829700, len=9229,
> forceFlush=false], lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:37:13,738][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
> Finished applying WAL changes [updatesApplied=0, time=607ms]
> [21:37:13,749][INFO][exchange-worker-#62%statement_grid%][time] Finished
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
> crd=false]
> [21:37:13,782][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
> Received full message, will finish exchange
> [node=56c97317-26cf-43d2-bf76-0cab59c6fa5f, resVer=AffinityTopologyVersion
> [topVer=57, minorTopVer=0]]
> [21:37:13,796][INFO][grid-nio-worker-tcp-comm-4-#45%statement_grid%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
> rmtAddr=/172.31.24.98:58928]
> [21:37:13,796][INFO][grid-nio-worker-tcp-comm-3-#44%statement_grid%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
> rmtAddr=/172.31.27.95:52532]
> [21:37:13,804][INFO][grid-nio-worker-tcp-comm-5-#46%statement_grid%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
> rmtAddr=/172.31.17.174:54358]
> [21:37:13,805][INFO][grid-nio-worker-tcp-comm-6-#47%statement_grid%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.31.20.209:48100,
> rmtAddr=/172.31.23.109:42528]
> [21:37:17,569][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
> Finish exchange future [startVer=AffinityTopologyVersion [topVer=57,
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=57, minorTopVer=0],
> err=null]
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)