[jira] [Updated] (IGNITE-4491) Commutation loss between two nodes leads to hang whole cluster

Vladislav Pyatkov (JIRA) Wed, 25 Jan 2017 01:55:04 -0800

     [ 
https://issues.apache.org/jira/browse/IGNITE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vladislav Pyatkov updated IGNITE-4491:
--------------------------------------
    Description: 
Reproduction steps:
1) Start nodes:

{noformat}
DC1                       DC2

1 (10.116.172.1)      8 (10.116.64.11)
2 (10.116.172.2)      7 (10.116.64.12)
3 (10.116.172.3)      6 (10.116.64.13)
4 (10.116.172.4)      5 (10.116.64.14)
{noformat}

Only at one node have a client 10.116.172.3. Attachments:

2) Drop connection

Between 2-8,
{noformat}
2 (10.116.172.2)      8 (10.116.64.11)
{noformat}

Drop all input and output traffic
Invoke from 10.116.172.2
{code}
iptables -A INPUT -s 10.116.64.11 -j DROP
iptables -A OUTPUT -d 10.116.64.11 -j DROP
{code}

Between  4-6

{noformat}
4 (10.116.172.4)      6 (10.116.64.13)
{noformat}

Invoke from 10.116.172.4
{code}
iptables -A INPUT -s 10.116.64.13 -j DROP
iptables -A OUTPUT -d 10.116.64.13 -j DROP
{code}

3) All client transaction have stopped due to lock on several keys.

At a client log I saw a message about long running operation (by some reason 
timeout did not work there):
{noformat}
2017-01-24 13:51:52 WARN  GridCachePartitionExchangeManager:480 - Found long 
running cache future [startTime=13:50:26.076, curTime=13:51:52.589, 
fut=GridNearTxFinishFuture 
[futId=05d4910d951-40022b05-4d4c-4d5d-a05e-e20bcd4ba0bf, tx=GridNearTxLocal...
{noformat}

At a server log I saw some exception at "prepare" phase:
{noformat}
2017-01-24 13:50:36 ERROR IgniteTxHandler:495 - Failed to prepare DHT 
transaction: GridDhtTxLocal...
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareFuture$PrepareTimeoutObject.onTimeout(GridDhtTxPrepareFuture.java:1806)
        at 
org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:159)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

  was:
Reproduction steps:
1) Start nodes:

{noformat}
DC1                       DC2

1 (10.116.172.1)      8 (10.116.64.11)
2 (10.116.172.2)      7 (10.116.64.12)
3 (10.116.172.3)      6 (10.116.64.13)
4 (10.116.172.4)      5 (10.116.64.14)
{noformat}

each node have client which run in same host with server (look source in 
attachment).

2) Drop connection

Between 1-8,
{noformat}
1 (10.116.172.1)      8 (10.116.64.11)
{noformat}

Drop all input and output traffic
Invoke from 10.116.172.1
{code}
iptables -A INPUT -s 10.116.64.11 -j DROP
iptables -A OUTPUT -d 10.116.64.11 -j DROP
{code}

Between  4-5

{noformat}
4 (10.116.172.4)      5 (10.116.64.14)
{noformat}

Invoke from 10.116.172.4
{code}
iptables -A INPUT -s 10.116.64.14 -j DROP
iptables -A OUTPUT -d 10.116.64.14 -j DROP
{code}

3) Stop the grid, after several seconds

If you are looking into logs, you can find which nodes 1(10.116.172.1) and 
5(10.116.64.14) were segmented and stopped according to policy (pay attention, 
which clients did not segmented), after drop traffic:
{noformat}
[12:04:33,914][INFO][disco-event-worker-#211%null%][GridDiscoveryManager] 
Topology snapshot [ver=18, servers=6, clients=8, CPUs=456, heap=68.0GB]
{noformat}

And all operations stopped at the same time.


> Commutation loss between two nodes leads to hang whole cluster
> --------------------------------------------------------------
>
>                 Key: IGNITE-4491
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4491
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 1.8
>            Reporter: Vladislav Pyatkov
>            Priority: Critical
>         Attachments: Segmentation.7z
>
>
> Reproduction steps:
> 1) Start nodes:
> {noformat}
> DC1                       DC2
> 1 (10.116.172.1)      8 (10.116.64.11)
> 2 (10.116.172.2)      7 (10.116.64.12)
> 3 (10.116.172.3)      6 (10.116.64.13)
> 4 (10.116.172.4)      5 (10.116.64.14)
> {noformat}
> Only at one node have a client 10.116.172.3. Attachments:
> 2) Drop connection
> Between 2-8,
> {noformat}
> 2 (10.116.172.2)      8 (10.116.64.11)
> {noformat}
> Drop all input and output traffic
> Invoke from 10.116.172.2
> {code}
> iptables -A INPUT -s 10.116.64.11 -j DROP
> iptables -A OUTPUT -d 10.116.64.11 -j DROP
> {code}
> Between  4-6
> {noformat}
> 4 (10.116.172.4)      6 (10.116.64.13)
> {noformat}
> Invoke from 10.116.172.4
> {code}
> iptables -A INPUT -s 10.116.64.13 -j DROP
> iptables -A OUTPUT -d 10.116.64.13 -j DROP
> {code}
> 3) All client transaction have stopped due to lock on several keys.
> At a client log I saw a message about long running operation (by some reason 
> timeout did not work there):
> {noformat}
> 2017-01-24 13:51:52 WARN  GridCachePartitionExchangeManager:480 - Found long 
> running cache future [startTime=13:50:26.076, curTime=13:51:52.589, 
> fut=GridNearTxFinishFuture 
> [futId=05d4910d951-40022b05-4d4c-4d5d-a05e-e20bcd4ba0bf, tx=GridNearTxLocal...
> {noformat}
> At a server log I saw some exception at "prepare" phase:
> {noformat}
> 2017-01-24 13:50:36 ERROR IgniteTxHandler:495 - Failed to prepare DHT 
> transaction: GridDhtTxLocal...
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareFuture$PrepareTimeoutObject.onTimeout(GridDhtTxPrepareFuture.java:1806)
>       at 
> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:159)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (IGNITE-4491) Commutation loss between two nodes leads to hang whole cluster

Reply via email to