[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

Pavel Voronkin (JIRA) Thu, 04 Oct 2018 06:33:35 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pavel Voronkin updated IGNITE-9793:
-----------------------------------
    Description: 
There is coordinator and ring of nodes

665(coordinator) -> 601 -> 724 - > 910 -> 655

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}, 
participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=190000.0GB, heap=4800.0GB]
2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4] 

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=190000.0GB, heap=4800.0GB]
2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=155, offline=5]

topology rolled to version 188.

Node 724 SEGMENTED before activation starts:

2018-09-24 15:17:50.068 [WARN 
][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
 Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Local node SEGMENTED: TcpDiscoveryNode 
[id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]

2018-09-24 15:18:07.145 
[ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=class o.a.i.failure.StopNodeFailureHandler, 
failureCtx=FailureContext [type=SEGMENTATION, err=null]]

 

Coordinator started activation on topology without node2

2018-09-24 15:19:48.686 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

But node 3 which is next to node 2 haven't received activation message.

Coordinator sent activation to all except 3.

2018-09-24 15:24:25.911 [INFO 
][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, 
minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, allReceived=false]

GridDhtPartitionsExchangeFuture hangs forever.

 So one node in the ring missed the message, hovever all other nodes in 
topology got it, how is that possible?

 

  was:
There is coordinator and ring of nodes

coordinator -> 1 -> 2 - > 3 -> 4

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 1, 2, 3, 4 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Node 2 SEGMENTED

2018-09-24 15:17:50.068 [WARN 
][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
 Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Local node SEGMENTED: TcpDiscoveryNode 
[id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]

Coordinator started activation on topology without node2

2018-09-24 15:19:48.686 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

But node 3 which is next to node 2 haven't received activation message.

Coordinator sent activation to all except 3.

2018-09-24 15:24:25.911 [INFO 
][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, 
minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, allReceived=false]

GridDhtPartitionsExchangeFuture hangs forever.

 So one node in the ring missed the message, hovever all other nodes in 
topology got it, how is that possible?

 


> Deactivation, segmentation of one node, activation may lead to hang 
> activation forever
> --------------------------------------------------------------------------------------
>
>                 Key: IGNITE-9793
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9793
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.5
>            Reporter: Pavel Voronkin
>            Priority: Major
>
> There is coordinator and ring of nodes
> 665(coordinator) -> 601 -> 724 - > 910 -> 655
> coordinator deactivated:
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
>  2018-09-24 15:09:01.620 
> [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
>  Server latch is created [latch=CompletableLatchUid
> {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}, 
> participantsSize=160]
>  2018-09-24 15:09:01.621 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]
> nodes 601, 724, 910, 655 were deactivated:
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.328 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.334 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.332 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> Spi on coordinator received node 724 failed message:
> 2018-09-24 15:17:00.220 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
> addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
> [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, 
> intOrder=110, lastExchangeTime=1537528210290, loc=false, 
> ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
> 2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
> offheap=190000.0GB, heap=4800.0GB]
> 2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
> 2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Baseline [id=6, size=160, online=156, offline=4] 
> topology rolled to version 187, then another node 931 failed:
> 2018-09-24 15:17:00.466 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
> addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
> [grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, 
> intOrder=73, lastExchangeTime=1537528186599, loc=false, 
> ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
> 2018-09-24 15:17:00.467 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
> offheap=190000.0GB, heap=4800.0GB]
> 2018-09-24 15:17:00.467 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
> 2018-09-24 15:17:00.467 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Baseline [id=6, size=160, online=155, offline=5]
> topology rolled to version 188.
> Node 724 SEGMENTED before activation starts:
> 2018-09-24 15:17:50.068 [WARN 
> ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
>  Node is out of topology (probably, due to short-time network problems).
>  2018-09-24 15:17:50.069 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Local node SEGMENTED: TcpDiscoveryNode 
> [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
> sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
> order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
> ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
> 2018-09-24 15:18:07.145 
> [ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
>  Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=class o.a.i.failure.StopNodeFailureHandler, 
> failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>  
> Coordinator started activation on topology without node2
> 2018-09-24 15:19:48.686 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
> client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]
> But node 3 which is next to node 2 haven't received activation message.
> Coordinator sent activation to all except 3.
> 2018-09-24 15:24:25.911 [INFO 
> ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Coordinator received single message [ver=AffinityTopologyVersion 
> [topVer=188, minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, 
> allReceived=false]
> GridDhtPartitionsExchangeFuture hangs forever.
>  So one node in the ring missed the message, hovever all other nodes in 
> topology got it, how is that possible?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

Reply via email to