[ 
https://issues.apache.org/jira/browse/IGNITE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724491#comment-16724491
 ] 

Amelchev Nikita commented on IGNITE-5115:
-----------------------------------------

[~akalashnikov], I have added one more node to the test and added checks for 
the failed event and new discovery ring size. TC tests look good. Please, take 
a look one more time.

> Investigation of failing tests of coordinator node failure 
> -----------------------------------------------------------
>
>                 Key: IGNITE-5115
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5115
>             Project: Ignite
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Sergey Chugunov
>            Assignee: Amelchev Nikita
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.8
>
>
> Tests *customEventCoordinatorFailure1/2* from *TcpDiscoverySelfTest* are 
> flaky on TC and sometimes hang with the following assertion in logs:
> {code}
> Exception in thread "tcp-disco-msg-worker-#5245%tcp.TcpDiscoverySelfTest0%" 
> java.lang.AssertionError
>       at 
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.removeNode(TcpDiscoveryNodesRing.java:353)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeFailedMessage(ServerImpl.java:4670)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2567)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2366)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6485)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2456)
>       at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {code}
> It seems that this happens because tests' implementation drops connections of 
> *TcpCommunicatonSpi* on coordinator node with *simulateNodeFailure* method.
> At the same time tests leave *TcpDiscoverySpi* operational; it receives 
> subsequent NodeFailed message and throws the assertion error shown above.
> The whole situation looks legitimate as it is possible to imagine a situation 
> when CommSPI connections on coordinator fail for some reason while DiscoSPI 
> connections are healthy.
> It is needed to investigate the situation deeper, figure out whether the root 
> cause is using of *simulateNodeFailure* or not and propose a solution if the 
> error may happen in the real life.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to