[ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628412#comment-16628412
 ] 

Maxim Muzafarov commented on IGNITE-7196:
-----------------------------------------

[~dpavlov], [~DmitriyGovorukhin],

In general, I don't think that it's possible to get clear TC Bot visa on such 
PRs. I've noticed that if TC agents are under the pressure we will get failure 
suites more often.

But, test {{IgnitePdsDestroyCacheTest.testDestroyCachesAbruptly}} *definitely 
failed due to my changes*. It's flaky - [the next PDS (Direct IO) 
1|https://ci.ignite.apache.org/viewLog.html?buildId=1942756] is OK here. 
Probable cause -- we remove cache configuration from the store before writing 
checkpoint end marker, so on binary restore, we will get an exception with 
missing cache descriptor. 
I need some time to investigate the original cause. 



> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches
> ----------------------------------------------------------------------------------------
>
>                 Key: IGNITE-7196
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7196
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.3
>            Reporter: Mikhail Cherkasov
>            Assignee: Maxim Muzafarov
>            Priority: Critical
>             Fix For: 2.7
>
>
> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches, there's a log snippet from a just joined new node that shows 
> the issue:
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
>  Resolved page store work directory: 
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log work directory: 
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log archive directory: 
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Ignite node stopped in the middle of checkpoint. Will restore memory state 
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to 
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268, 
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
> crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500], 
> discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, 
> loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], 
> exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57, 
> minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode 
> [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 
> 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, 
> /127.0.0.1:48500, 
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268, 
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
> nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter 
> [ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false, 
> lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null, 
> initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null, 
> done=false, state=SRV, evtLatch=0, 
> remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68, 
> ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
> f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164, 
> 979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef], 
> super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, 
> hash=542833742]]]
> [21:37:12,866][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checkpoint finished [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pages=157778, markPos=FileWALPointer [idx=3629, fileOffset=50829700, 
> len=9229, forceFlush=false], pagesWrite=5460ms, fsync=45781ms, total=51241ms]
> [21:37:12,878][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resuming logging to WAL segment 
> [file=/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/0000000000000009.wal,
>  offset=50838929, ver=1]
> [21:37:13,028][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324, 
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
> crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500], 
> discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, 
> loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], 
> exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57, 
> minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode 
> [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 
> 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, 
> /127.0.0.1:48500, 
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324, 
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
> nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter 
> [ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false, 
> lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null, 
> initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null, 
> done=false, state=SRV, evtLatch=0, 
> remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68, 
> ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
> f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164, 
> 979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef], 
> super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, 
> hash=542833742]]]
> [21:37:13,116][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor] 
> Started cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc, 
> mode=REPLICATED, atomicity=TRANSACTIONAL]
> [21:37:13,125][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor] 
> Started cache [name=statementp, memoryPolicyName=Default_Region, 
> mode=PARTITIONED, atomicity=TRANSACTIONAL]
> [21:37:13,131][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finished waiting for partition release future 
> [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], waitTime=0ms, 
> futInfo=NA]
> [21:37:13,133][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-END.bin]
> [21:37:13,134][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Applying lost cache updates since last checkpoint record 
> [lastMarked=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false], lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:37:13,738][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying WAL changes [updatesApplied=0, time=607ms]
> [21:37:13,749][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false]
> [21:37:13,782][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Received full message, will finish exchange 
> [node=56c97317-26cf-43d2-bf76-0cab59c6fa5f, resVer=AffinityTopologyVersion 
> [topVer=57, minorTopVer=0]]
> [21:37:13,796][INFO][grid-nio-worker-tcp-comm-4-#45%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.24.98:58928]
> [21:37:13,796][INFO][grid-nio-worker-tcp-comm-3-#44%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.95:52532]
> [21:37:13,804][INFO][grid-nio-worker-tcp-comm-5-#46%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.174:54358]
> [21:37:13,805][INFO][grid-nio-worker-tcp-comm-6-#47%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.23.109:42528]
> [21:37:17,569][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=57, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> err=null]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to