[ 
https://issues.apache.org/jira/browse/IGNITE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Pyatkov updated IGNITE-4491:
--------------------------------------
    Description: 
Reproduction steps:
1) Start nodes:

{noformat}
DC1                       DC2

1 (10.116.172.1)      8 (10.116.64.11)
2 (10.116.172.2)      7 (10.116.64.12)
3 (10.116.172.3)      6 (10.116.64.13)
4 (10.116.172.4)      5 (10.116.64.14)
{noformat}

each node have client which run in same host with server (look source in 
attachment).

2) Drop connection

Between 1-8,
{noformat}
1 (10.116.172.1)      8 (10.116.64.11)
{noformat}

Drop all input and output traffic
Invoke from 10.116.172.1
{code}
iptables -A INPUT -s 10.116.64.11 -j DROP
iptables -A OUTPUT -d 10.116.64.11 -j DROP
{code}

Between  4-5

{noformat}
4 (10.116.172.4)      5 (10.116.64.14)
{noformat}

Invoke from 10.116.172.4
{code}
iptables -A INPUT -s 10.116.64.14 -j DROP
iptables -A OUTPUT -d 10.116.64.14 -j DROP
{code}

3) Stop the grid, after several seconds

If you are looking into logs, you can find which node was segmented (pay 
attention, which clients did not segmented), after drop traffic:
{noformat}
[12:04:33,914][INFO][disco-event-worker-#211%null%][GridDiscoveryManager] 
Topology snapshot [ver=18, servers=6, clients=8, CPUs=456, heap=68.0GB]
{noformat}

And all operations stopped at the same time.

  was:
Reproduction steps:
1) Start nodes:

{noformat}
DC1                       DC2

1 (10.116.172.1)      8 (10.116.64.11)
2 (10.116.172.2)      7 (10.116.64.12)
3 (10.116.172.3)      6 (10.116.64.13)
4 (10.116.172.4)      5 (10.116.64.14)
{noformat}

each node have client which run in same host with server (look source in 
attachment).

2) Drop connection

Between 1-8,
{noformat}
1 (10.116.172.1)      8 (10.116.64.11)
{noformat}

Drop all input and output traffic
Invoke from 10.116.172.1
{code}
iptables -A INPUT -s 10.116.64.11 -j DROP
iptables -A OUTPUT -d 10.116.64.11 -j DROP
{code}

Between  4-5

{noformat}
4 (10.116.172.4)      5 (10.116.64.14)
{noformat}

Invoke from 10.116.172.4
{code}
iptables -A INPUT -s 10.116.64.14 -j DROP
iptables -A OUTPUT -d 10.116.64.14 -j DROP
{code}

3) Stop the grid, after several seconds

If you are looking into logs, you can find which node was segmented (pay 
attention, which clients did not segmented.), after drop traffic:
{noformat}
[12:04:33,914][INFO][disco-event-worker-#211%null%][GridDiscoveryManager] 
Topology snapshot [ver=18, servers=6, clients=8, CPUs=456, heap=68.0GB]
{noformat}

And all operations stopped at the same time.


> Commutation loss between two nodes leads to hang whole cluster.
> ---------------------------------------------------------------
>
>                 Key: IGNITE-4491
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4491
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 1.8
>            Reporter: Vladislav Pyatkov
>            Priority: Critical
>         Attachments: Segmentation.7z
>
>
> Reproduction steps:
> 1) Start nodes:
> {noformat}
> DC1                       DC2
> 1 (10.116.172.1)      8 (10.116.64.11)
> 2 (10.116.172.2)      7 (10.116.64.12)
> 3 (10.116.172.3)      6 (10.116.64.13)
> 4 (10.116.172.4)      5 (10.116.64.14)
> {noformat}
> each node have client which run in same host with server (look source in 
> attachment).
> 2) Drop connection
> Between 1-8,
> {noformat}
> 1 (10.116.172.1)      8 (10.116.64.11)
> {noformat}
> Drop all input and output traffic
> Invoke from 10.116.172.1
> {code}
> iptables -A INPUT -s 10.116.64.11 -j DROP
> iptables -A OUTPUT -d 10.116.64.11 -j DROP
> {code}
> Between  4-5
> {noformat}
> 4 (10.116.172.4)      5 (10.116.64.14)
> {noformat}
> Invoke from 10.116.172.4
> {code}
> iptables -A INPUT -s 10.116.64.14 -j DROP
> iptables -A OUTPUT -d 10.116.64.14 -j DROP
> {code}
> 3) Stop the grid, after several seconds
> If you are looking into logs, you can find which node was segmented (pay 
> attention, which clients did not segmented), after drop traffic:
> {noformat}
> [12:04:33,914][INFO][disco-event-worker-#211%null%][GridDiscoveryManager] 
> Topology snapshot [ver=18, servers=6, clients=8, CPUs=456, heap=68.0GB]
> {noformat}
> And all operations stopped at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to