[ https://issues.apache.org/jira/browse/IGNITE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amelchev Nikita updated IGNITE-5115: ------------------------------------ Issue Type: Bug (was: Task) > Investigation of failing tests of coordinator node failure > ----------------------------------------------------------- > > Key: IGNITE-5115 > URL: https://issues.apache.org/jira/browse/IGNITE-5115 > Project: Ignite > Issue Type: Bug > Components: messaging > Reporter: Sergey Chugunov > Assignee: Amelchev Nikita > Priority: Major > Fix For: 2.8 > > > Tests *customEventCoordinatorFailure1/2* from *TcpDiscoverySelfTest* are > flaky on TC and sometimes hang with the following assertion in logs: > {code} > Exception in thread "tcp-disco-msg-worker-#5245%tcp.TcpDiscoverySelfTest0%" > java.lang.AssertionError > at > org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.removeNode(TcpDiscoveryNodesRing.java:353) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeFailedMessage(ServerImpl.java:4670) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2567) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2366) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6485) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2456) > at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) > {code} > It seems that this happens because tests' implementation drops connections of > *TcpCommunicatonSpi* on coordinator node with *simulateNodeFailure* method. > At the same time tests leave *TcpDiscoverySpi* operational; it receives > subsequent NodeFailed message and throws the assertion error shown above. > The whole situation looks legitimate as it is possible to imagine a situation > when CommSPI connections on coordinator fail for some reason while DiscoSPI > connections are healthy. > It is needed to investigate the situation deeper, figure out whether the root > cause is using of *simulateNodeFailure* or not and propose a solution if the > error may happen in the real life. -- This message was sent by Atlassian JIRA (v7.6.3#76005)