Sergey Chugunov created IGNITE-5115:
---------------------------------------
Summary: Investigation of failing tests of coordinator node
failure
Key: IGNITE-5115
URL: https://issues.apache.org/jira/browse/IGNITE-5115
Project: Ignite
Issue Type: Task
Components: messaging
Reporter: Sergey Chugunov
Fix For: 2.1
Tests *customEventCoordinatorFailure1/2* from *TcpDiscoverySelfTest* are flaky
on TC and sometimes hang with the following assertion in logs:
{code}
Exception in thread "tcp-disco-msg-worker-#5245%tcp.TcpDiscoverySelfTest0%"
java.lang.AssertionError
at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.removeNode(TcpDiscoveryNodesRing.java:353)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeFailedMessage(ServerImpl.java:4670)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2567)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2366)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6485)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2456)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
{code}
It seems that this happens because tests' implementation drops connections of
*TcpCommunicatonSpi* on coordinator node with *simulateNodeFailure* method.
At the same time tests leave *TcpDiscoverySpi* operational; it receives
subsequent NodeFailed message and throws the assertion error shown above.
The whole situation looks legitimate as it is possible to imagine a situation
when CommSPI connections on coordinator fail for some reason while DiscoSPI
connections are healthy.
It is needed to investigate the situation deeper, figure out whether the root
cause is using of *simulateNodeFailure* or not and propose a solution if the
error may happen in the real life.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)