[ https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pavel Voronkin updated IGNITE-9793: ----------------------------------- Description: There is coordinator and ring of nodes 665(coordinator) -> 601 -> 724 - > 910 -> 655 coordinator deactivated: 2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] 2018-09-24 15:09:01.620 [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager] Server latch is created [latch=CompletableLatchUid {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}, participantsSize=160] 2018-09-24 15:09:01.621 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%] nodes 601, 724, 910, 655 were deactivated: 2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] 2018-09-24 15:09:03.328 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] 2018-09-24 15:09:03.334 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] 2018-09-24 15:09:03.332 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] Spi on coordinator received node 724 failed message: 2018-09-24 15:17:00.220 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false] 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, offheap=190000.0GB, heap=4800.0GB] 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE] 2018-09-24 15:17:00.221 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] ^-- Baseline [id=6, size=160, online=156, offline=4] topology rolled to version 187, then another node 931 failed: 2018-09-24 15:17:00.466 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, addrs=ArrayList [10.116.215.3], sockAddrs=HashSet [grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false] 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, offheap=190000.0GB, heap=4800.0GB] 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE] 2018-09-24 15:17:00.467 [INFO ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] ^-- Baseline [id=6, size=160, online=155, offline=5] topology rolled to version 188. Node 724 SEGMENTED before activation starts: 2018-09-24 15:17:50.068 [WARN ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems). 2018-09-24 15:17:50.069 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false] 2018-09-24 15:18:07.145 [ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite] Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext [type=SEGMENTATION, err=null]] Coordinator started activation on topology without node2 2018-09-24 15:19:48.686 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]] But node 3 which is next to node 2 haven't received activation message. Coordinator sent activation to all except 3. 2018-09-24 15:24:25.911 [INFO ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, allReceived=false] GridDhtPartitionsExchangeFuture hangs forever. So one node in the ring missed the message, hovever all other nodes in topology got it, how is that possible? was: There is coordinator and ring of nodes coordinator -> 1 -> 2 - > 3 -> 4 coordinator deactivated: 2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] 2018-09-24 15:09:01.620 [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager] Server latch is created [latch=CompletableLatchUid {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]} , participantsSize=160] 2018-09-24 15:09:01.621 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%] nodes 1, 2, 3, 4 were deactivated: 2018-09-24 15:09:01.609 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] 2018-09-24 15:09:03.328 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] 2018-09-24 15:09:03.334 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] 2018-09-24 15:09:03.332 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] Node 2 SEGMENTED 2018-09-24 15:17:50.068 [WARN ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems). 2018-09-24 15:17:50.069 [WARN ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, ver=2.5.1#20180906-sha1:ebde6c79, isClient=false] Coordinator started activation on topology without node2 2018-09-24 15:19:48.686 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]] But node 3 which is next to node 2 haven't received activation message. Coordinator sent activation to all except 3. 2018-09-24 15:24:25.911 [INFO ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, allReceived=false] GridDhtPartitionsExchangeFuture hangs forever. So one node in the ring missed the message, hovever all other nodes in topology got it, how is that possible? > Deactivation, segmentation of one node, activation may lead to hang > activation forever > -------------------------------------------------------------------------------------- > > Key: IGNITE-9793 > URL: https://issues.apache.org/jira/browse/IGNITE-9793 > Project: Ignite > Issue Type: Bug > Affects Versions: 2.5 > Reporter: Pavel Voronkin > Priority: Major > > There is coordinator and ring of nodes > 665(coordinator) -> 601 -> 724 - > 910 -> 655 > coordinator deactivated: > 2018-09-24 15:09:01.609 [INFO > ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] > Successfully deactivated data structures, services and caches > [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, > topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] > 2018-09-24 15:09:01.620 > [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager] > Server latch is created [latch=CompletableLatchUid > {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}, > participantsSize=160] > 2018-09-24 15:09:01.621 [INFO > ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%] > nodes 601, 724, 910, 655 were deactivated: > 2018-09-24 15:09:01.609 [INFO > ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] > Successfully deactivated data structures, services and caches > [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, > topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] > 2018-09-24 15:09:03.328 [INFO > ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] > Successfully deactivated data structures, services and caches > [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, > topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] > 2018-09-24 15:09:03.334 [INFO > ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] > Successfully deactivated data structures, services and caches > [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, > topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] > 2018-09-24 15:09:03.332 [INFO > ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] > Successfully deactivated data structures, services and caches > [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, > topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]] > Spi on coordinator received node 724 failed message: > 2018-09-24 15:17:00.220 [WARN > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, > addrs=ArrayList [10.116.206.98], sockAddrs=HashSet > [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, > intOrder=110, lastExchangeTime=1537528210290, loc=false, > ver=2.5.1#20180906-sha1:ebde6c79, isClient=false] > 2018-09-24 15:17:00.221 [INFO > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, > offheap=190000.0GB, heap=4800.0GB] > 2018-09-24 15:17:00.221 [INFO > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE] > 2018-09-24 15:17:00.221 [INFO > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > ^-- Baseline [id=6, size=160, online=156, offline=4] > topology rolled to version 187, then another node 931 failed: > 2018-09-24 15:17:00.466 [WARN > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, > addrs=ArrayList [10.116.215.3], sockAddrs=HashSet > [grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, > intOrder=73, lastExchangeTime=1537528186599, loc=false, > ver=2.5.1#20180906-sha1:ebde6c79, isClient=false] > 2018-09-24 15:17:00.467 [INFO > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, > offheap=190000.0GB, heap=4800.0GB] > 2018-09-24 15:17:00.467 [INFO > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE] > 2018-09-24 15:17:00.467 [INFO > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > ^-- Baseline [id=6, size=160, online=155, offline=5] > topology rolled to version 188. > Node 724 SEGMENTED before activation starts: > 2018-09-24 15:17:50.068 [WARN > ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi] > Node is out of topology (probably, due to short-time network problems). > 2018-09-24 15:17:50.069 [WARN > ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager] > Local node SEGMENTED: TcpDiscoveryNode > [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], > sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, > order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, > ver=2.5.1#20180906-sha1:ebde6c79, isClient=false] > 2018-09-24 15:18:07.145 > [ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite] > Critical system error detected. Will be handled accordingly to configured > handler [hnd=class o.a.i.failure.StopNodeFailureHandler, > failureCtx=FailureContext [type=SEGMENTATION, err=null]] > > Coordinator started activation on topology without node2 > 2018-09-24 15:19:48.686 [INFO > ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] > Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, > client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]] > But node 3 which is next to node 2 haven't received activation message. > Coordinator sent activation to all except 3. > 2018-09-24 15:24:25.911 [INFO > ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture] > Coordinator received single message [ver=AffinityTopologyVersion > [topVer=188, minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, > allReceived=false] > GridDhtPartitionsExchangeFuture hangs forever. > So one node in the ring missed the message, hovever all other nodes in > topology got it, how is that possible? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)