[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches

2018-10-10 Thread Maxim Muzafarov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Muzafarov updated IGNITE-7196:

Fix Version/s: (was: 2.7)
   2.8

> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches
> 
>
> Key: IGNITE-7196
> URL: https://issues.apache.org/jira/browse/IGNITE-7196
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
> Fix For: 2.8
>
>
> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches, there's a log snippet from a just joined new node that shows 
> the issue:
> {noformat}
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
>  Resolved page store work directory: 
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log work directory: 
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log archive directory: 
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Ignite node stopped in the middle of checkpoint. Will restore memory state 
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to 
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> 

[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches

2018-10-07 Thread Dmitriy Pavlov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Pavlov updated IGNITE-7196:
---
Description: 
Exchange can stuck and wait while new node restoring state from disk and 
starting caches, there's a log snippet from a just joined new node that shows 
the issue:

{noformat}
[21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
customEvt=null, allowMerge=true]
[21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] 
Resolved page store work directory: 
/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
 Resolved write ahead log work directory: 
/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
 Resolved write ahead log archive directory: 
/mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
[21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
 Started write-ahead log manager [mode=DEFAULT]
[21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
KiB, checkpointBuffer=100.0 MiB]
[21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
MiB, checkpointBuffer=896.0 MiB]
[21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
 Read checkpoint status 
[startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
 
endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
[21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
 Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
[idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
[21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
 Ignite node stopped in the middle of checkpoint. Will restore memory state and 
finish checkpoint on node start.
[21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
 Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
rmtAddr=/172.31.17.115:57148]
[21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
 Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, forceFlush=false]]
[21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
 Finished applying memory changes [changesApplied=165103, time=8189ms]
[21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
 Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
rmtAddr=/172.31.28.10:47964]
[21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
 Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
rmtAddr=/172.31.27.101:46000]
[21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to wait 
for initial partition map exchange. Possible reasons are:
^-- Transactions in deadlock.
^-- Long running transactions (ignore if this is the case).
^-- Unreleased explicit locks.
[21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still waiting 
for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture 
[firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode 
[id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 
172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268, 
loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101], 
sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500], 
discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, loc=false, 
ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], 
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion 

[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches

2018-09-27 Thread Dmitriy Pavlov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Pavlov updated IGNITE-7196:
---
Fix Version/s: (was: 2.8)
   2.7

> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches
> 
>
> Key: IGNITE-7196
> URL: https://issues.apache.org/jira/browse/IGNITE-7196
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
> Fix For: 2.7
>
>
> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches, there's a log snippet from a just joined new node that shows 
> the issue:
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
>  Resolved page store work directory: 
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log work directory: 
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log archive directory: 
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Ignite node stopped in the middle of checkpoint. Will restore memory state 
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to 
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> 

[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches

2018-09-27 Thread Mikhail Cherkasov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Cherkasov updated IGNITE-7196:
--
Fix Version/s: (was: 2.7)
   2.8

> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches
> 
>
> Key: IGNITE-7196
> URL: https://issues.apache.org/jira/browse/IGNITE-7196
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
> Fix For: 2.8
>
>
> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches, there's a log snippet from a just joined new node that shows 
> the issue:
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
>  Resolved page store work directory: 
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log work directory: 
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log archive directory: 
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Ignite node stopped in the middle of checkpoint. Will restore memory state 
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to 
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> 

[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches

2018-06-26 Thread Dmitriy Pavlov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Pavlov updated IGNITE-7196:
---
Fix Version/s: (was: 2.6)
   2.7

> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches
> 
>
> Key: IGNITE-7196
> URL: https://issues.apache.org/jira/browse/IGNITE-7196
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Alexey Goncharuk
>Priority: Critical
> Fix For: 2.7
>
>
> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches, there's a log snippet from a just joined new node that shows 
> the issue:
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
>  Resolved page store work directory: 
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log work directory: 
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log archive directory: 
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Ignite node stopped in the middle of checkpoint. Will restore memory state 
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to 
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> 

[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches

2018-04-09 Thread Alexey Goncharuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Goncharuk updated IGNITE-7196:
-
Fix Version/s: (was: 2.5)
   2.6

> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches
> 
>
> Key: IGNITE-7196
> URL: https://issues.apache.org/jira/browse/IGNITE-7196
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Alexey Goncharuk
>Priority: Critical
> Fix For: 2.6
>
>
> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches, there's a log snippet from a just joined new node that shows 
> the issue:
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
>  Resolved page store work directory: 
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log work directory: 
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log archive directory: 
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Ignite node stopped in the middle of checkpoint. Will restore memory state 
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to 
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> 

[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches

2018-01-17 Thread Vladimir Ozerov (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Ozerov updated IGNITE-7196:

Fix Version/s: (was: 2.4)
   2.5

> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches
> 
>
> Key: IGNITE-7196
> URL: https://issues.apache.org/jira/browse/IGNITE-7196
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Alexey Goncharuk
>Priority: Critical
> Fix For: 2.5
>
>
> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches, there's a log snippet from a just joined new node that shows 
> the issue:
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
>  Resolved page store work directory: 
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log work directory: 
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log archive directory: 
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Ignite node stopped in the middle of checkpoint. Will restore memory state 
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to 
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
>