[ 
https://issues.apache.org/jira/browse/IGNITE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-5457:
--------------------------------------
    Description: 
I observe buggy behavior in case of simulated split brain.

Nodes in DataCenter1 (where coordinator is located) are slowly leave grid,

while nodes in DataCenter2 stay in grid forever.

In logs I see multiple attemps to kick coordinator by communcation by socket 
timeout, but number of nodes does not change.

{noformat}
19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=131ms, reason='timeout']
19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed 
out (consider increasing 'connTimeout' configuration property) 
[addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000]
19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - 
TcpCommunicationSpi failed to establish connection to node, node will be 
dropped from cluster [rmtNode=TcpDiscoveryNode 
[id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], 
sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, 
intOrder=1, lastExchangeTime=1496936257121, loc=false, 
ver=1.10.3#20170604-sha1:30521a17, isClient=false]]
org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node 
still alive?). Make sure that each ComputeTask and cache Transaction has a 
timeout set in order to prevent parties from waiting forever in case of network 
issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, 
addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 
[ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect 
to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null]
                at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
                ... 6 common frames omitted
        Caused by: java.net.SocketTimeoutException: null
                at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)
                at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884)
                ... 6 common frames omitted
19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=130ms, reason='timeout']
19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=211ms, reason='timeout']
19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200]
    ^-- H/N/C [hosts=144, nodes=160, CPUs=8064]
    ^-- CPU [cur=0.2%, avg=2.37%, GC=0%]
    ^-- PageMemory [pages=604144]
    ^-- Heap [used=33396MB, free=49.04%, comm=65536MB]
    ^-- Non heap [used=171MB, free=-1%, comm=173MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=0, qSize=0]
    ^-- Outbound messages queue [size=0]
{noformat}

  was:
I observe buggy behavior in case of simulated split brain.

Nodes in DataCenter1 (where coordinator is located) are slowly leave grid,

while nodes in DataCenter2 stay in grid forever.

In logs I see attempts multiple attemps to kick coordinator by communcation by 
socket timeout, but number of nodes does not change.

{noformat}
19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=131ms, reason='timeout']
19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed 
out (consider increasing 'connTimeout' configuration property) 
[addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000]
19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - 
TcpCommunicationSpi failed to establish connection to node, node will be 
dropped from cluster [rmtNode=TcpDiscoveryNode 
[id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], 
sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, 
intOrder=1, lastExchangeTime=1496936257121, loc=false, 
ver=1.10.3#20170604-sha1:30521a17, isClient=false]]
org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node 
still alive?). Make sure that each ComputeTask and cache Transaction has a 
timeout set in order to prevent parties from waiting forever in case of network 
issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, 
addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 
[ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect 
to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null]
                at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
                ... 6 common frames omitted
        Caused by: java.net.SocketTimeoutException: null
                at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)
                at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884)
                ... 6 common frames omitted
19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=130ms, reason='timeout']
19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=211ms, reason='timeout']
19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200]
    ^-- H/N/C [hosts=144, nodes=160, CPUs=8064]
    ^-- CPU [cur=0.2%, avg=2.37%, GC=0%]
    ^-- PageMemory [pages=604144]
    ^-- Heap [used=33396MB, free=49.04%, comm=65536MB]
    ^-- Non heap [used=171MB, free=-1%, comm=173MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=0, qSize=0]
    ^-- Outbound messages queue [size=0]
{noformat}


> Weird discovery behavior on split brain.
> ----------------------------------------
>
>                 Key: IGNITE-5457
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5457
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 2.0
>            Reporter: Alexei Scherbakov
>            Priority: Critical
>             Fix For: 2.2
>
>
> I observe buggy behavior in case of simulated split brain.
> Nodes in DataCenter1 (where coordinator is located) are slowly leave grid,
> while nodes in DataCenter2 stay in grid forever.
> In logs I see multiple attemps to kick coordinator by communcation by socket 
> timeout, but number of nodes does not change.
> {noformat}
> 19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
> Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
> checkpointLockHoldTime=131ms, reason='timeout']
> 19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed 
> out (consider increasing 'connTimeout' configuration property) 
> [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000]
> 19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - 
> TcpCommunicationSpi failed to establish connection to node, node will be 
> dropped from cluster [rmtNode=TcpDiscoveryNode 
> [id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], 
> sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, 
> intOrder=1, lastExchangeTime=1496936257121, loc=false, 
> ver=1.10.3#20170604-sha1:30521a17, isClient=false]]
> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node 
> still alive?). Make sure that each ComputeTask and cache Transaction has a 
> timeout set in order to prevent parties from waiting forever in case of 
> network issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, 
> addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 
> [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect 
> to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null]
>               at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>               ... 6 common frames omitted
>       Caused by: java.net.SocketTimeoutException: null
>               at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)
>               at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884)
>               ... 6 common frames omitted
> 19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
> Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
> checkpointLockHoldTime=130ms, reason='timeout']
> 19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
> Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
> checkpointLockHoldTime=211ms, reason='timeout']
> 19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - 
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200]
>     ^-- H/N/C [hosts=144, nodes=160, CPUs=8064]
>     ^-- CPU [cur=0.2%, avg=2.37%, GC=0%]
>     ^-- PageMemory [pages=604144]
>     ^-- Heap [used=33396MB, free=49.04%, comm=65536MB]
>     ^-- Non heap [used=171MB, free=-1%, comm=173MB]
>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>     ^-- System thread pool [active=0, idle=0, qSize=0]
>     ^-- Outbound messages queue [size=0]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to