[
https://issues.apache.org/jira/browse/IGNITE-11364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Daschinskiy updated IGNITE-11364:
--------------------------------------
Description:
While segmenting by partial network drop, i.e. by applying iptables rules, can
cause ring broke.
Scenario:
On each machine there are two nodes, client and server respectivelly.
Lets draw diagram (only server nodes for brevity, they have been started before
clients).
=> grid915 => ....... => grid947 => grid945 => grid703 => ..skip 12 nodes...=>
grid952 => grid946.
On grid945 machine we drop incoming/outgoing connections by iptables.
During ongoing drop of connection, grid945 send TcpDiscoveryStatusCheckMessage,
but cannot send them to grid703 and others mentioned above 12 nodes, but some
next node accepted it with collection of failedNodes (13 nodes above). This
message was received by grid947 and it skip these 13 nodes in
org.apache.ignite.spi.discovery.tcp.ServerImpl.RingMessageWorker#sendMessageAcrossRing.
So we see this situation in topology:
.................. => grid947 => grid952=>..............
........................................//
grid703=>................=>grid662
Mentioned above nodes are not considered by grid as failed, so we see topology
as tail connected to ring.
was:
While segmenting by partial network drop, i.e. by applying iptables rules, can
cause ring broke.
Scenario:
On each machine there are two nodes, client and server respectivelly.
Lets draw diagram (only server nodes for brevity, they have been started before
clients).
=> grid915 => ....... => grid947 => grid945 => grid703 => ..skip 12 nodes...=>
grid952 => grid946.
On grid945 machine we drop incoming/outgoing connections by iptables.
During ongoing drop of connection, grid945 send TcpDiscoveryStatusCheckMessage,
but cannot send them to grid703 and others mentioned above 12 nodes, but some
next node accepted it with collection of failedNodes (13 nodes above). This
message was received by grid947 and it skip these 13 nodes in
org.apache.ignite.spi.discovery.tcp.ServerImpl.RingMessageWorker#sendMessageAcrossRing.
So we see this situation in topology:
.................. => grid947 => grid952=>..............
........................................//
grid703=>................=>grid662
Mentionde above nodes are not considered by grid as failed, so we see topology
as tail connected to ring.
> Segmenting node can cause ring topology broke
> ---------------------------------------------
>
> Key: IGNITE-11364
> URL: https://issues.apache.org/jira/browse/IGNITE-11364
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.5, 2.6, 2.7
> Reporter: Ivan Daschinskiy
> Priority: Blocker
> Fix For: 2.8
>
>
> While segmenting by partial network drop, i.e. by applying iptables rules,
> can cause ring broke.
> Scenario:
> On each machine there are two nodes, client and server respectivelly.
> Lets draw diagram (only server nodes for brevity, they have been started
> before clients).
> => grid915 => ....... => grid947 => grid945 => grid703 => ..skip 12
> nodes...=> grid952 => grid946.
> On grid945 machine we drop incoming/outgoing connections by iptables.
> During ongoing drop of connection, grid945 send
> TcpDiscoveryStatusCheckMessage, but cannot send them to grid703 and others
> mentioned above 12 nodes, but some next node accepted it with collection of
> failedNodes (13 nodes above). This message was received by grid947 and it
> skip these 13 nodes in
> org.apache.ignite.spi.discovery.tcp.ServerImpl.RingMessageWorker#sendMessageAcrossRing.
> So we see this situation in topology:
> .................. => grid947 => grid952=>..............
> ........................................//
> grid703=>................=>grid662
> Mentioned above nodes are not considered by grid as failed, so we see
> topology as tail connected to ring.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)