[ https://issues.apache.org/jira/browse/IGNITE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexei Scherbakov updated IGNITE-5457: -------------------------------------- Description: I observe buggy behavior in case of simulated split brain. Nodes in DataCenter1 (where coordinator is located) are slowly leave grid, while nodes in DataCenter2 stay in grid forever. In logs I see multiple attemps to kick coordinator by communcation by socket timeout, but number of nodes does not change. {noformat} 19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=131ms, reason='timeout'] 19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed out (consider increasing 'connTimeout' configuration property) [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000] 19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster [rmtNode=TcpDiscoveryNode [id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1496936257121, loc=false, ver=1.10.3#20170604-sha1:30521a17, isClient=false]] org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] ... 6 common frames omitted Caused by: java.net.SocketTimeoutException: null at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884) ... 6 common frames omitted 19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=130ms, reason='timeout'] 19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=211ms, reason='timeout'] 19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200] ^-- H/N/C [hosts=144, nodes=160, CPUs=8064] ^-- CPU [cur=0.2%, avg=2.37%, GC=0%] ^-- PageMemory [pages=604144] ^-- Heap [used=33396MB, free=49.04%, comm=65536MB] ^-- Non heap [used=171MB, free=-1%, comm=173MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=0, qSize=0] ^-- Outbound messages queue [size=0] {noformat} was: I observe buggy behavior in case of simulated split brain. Nodes in DataCenter1 (where coordinator is located) are slowly leave grid, while nodes in DataCenter2 stay in grid forever. In logs I see attempts multiple attemps to kick coordinator by communcation by socket timeout, but number of nodes does not change. {noformat} 19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=131ms, reason='timeout'] 19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed out (consider increasing 'connTimeout' configuration property) [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000] 19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - TcpCommunicationSpi failed to establish connection to node, node will be dropped from cluster [rmtNode=TcpDiscoveryNode [id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1496936257121, loc=false, ver=1.10.3#20170604-sha1:30521a17, isClient=false]] org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027) [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] ... 6 common frames omitted Caused by: java.net.SocketTimeoutException: null at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884) ... 6 common frames omitted 19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=130ms, reason='timeout'] 19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=211ms, reason='timeout'] 19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200] ^-- H/N/C [hosts=144, nodes=160, CPUs=8064] ^-- CPU [cur=0.2%, avg=2.37%, GC=0%] ^-- PageMemory [pages=604144] ^-- Heap [used=33396MB, free=49.04%, comm=65536MB] ^-- Non heap [used=171MB, free=-1%, comm=173MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=0, qSize=0] ^-- Outbound messages queue [size=0] {noformat} > Weird discovery behavior on split brain. > ---------------------------------------- > > Key: IGNITE-5457 > URL: https://issues.apache.org/jira/browse/IGNITE-5457 > Project: Ignite > Issue Type: Bug > Components: general > Affects Versions: 2.0 > Reporter: Alexei Scherbakov > Priority: Critical > Fix For: 2.2 > > > I observe buggy behavior in case of simulated split brain. > Nodes in DataCenter1 (where coordinator is located) are slowly leave grid, > while nodes in DataCenter2 stay in grid forever. > In logs I see multiple attemps to kick coordinator by communcation by socket > timeout, but number of nodes does not change. > {noformat} > 19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - > Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, > checkpointLockHoldTime=131ms, reason='timeout'] > 19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed > out (consider increasing 'connTimeout' configuration property) > [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000] > 19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - > TcpCommunicationSpi failed to establish connection to node, node will be > dropped from cluster [rmtNode=TcpDiscoveryNode > [id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], > sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, > intOrder=1, lastExchangeTime=1496936257121, loc=false, > ver=1.10.3#20170604-sha1:30521a17, isClient=false]] > org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node > still alive?). Make sure that each ComputeTask and cache Transaction has a > timeout set in order to prevent parties from waiting forever in case of > network issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, > addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]] > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022) > [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636) > [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528) > [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245) > [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830) > [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656) > [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) > [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect > to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null] > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027) > [ignite-core-1.10.3.ea10.jar:1.10.3.ea10] > ... 6 common frames omitted > Caused by: java.net.SocketTimeoutException: null > at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118) > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884) > ... 6 common frames omitted > 19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - > Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, > checkpointLockHoldTime=130ms, reason='timeout'] > 19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - > Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, > checkpointLockHoldTime=211ms, reason='timeout'] > 19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - > Metrics for local node (to disable set 'metricsLogFrequency' to 0) > ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200] > ^-- H/N/C [hosts=144, nodes=160, CPUs=8064] > ^-- CPU [cur=0.2%, avg=2.37%, GC=0%] > ^-- PageMemory [pages=604144] > ^-- Heap [used=33396MB, free=49.04%, comm=65536MB] > ^-- Non heap [used=171MB, free=-1%, comm=173MB] > ^-- Public thread pool [active=0, idle=0, qSize=0] > ^-- System thread pool [active=0, idle=0, qSize=0] > ^-- Outbound messages queue [size=0] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)