[ 
https://issues.apache.org/jira/browse/IGNITE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686252#comment-16686252
 ] 

ASF GitHub Bot commented on IGNITE-5115:
----------------------------------------

GitHub user NSAmelchev opened a pull request:

    https://github.com/apache/ignite/pull/5391

    IGNITE-5115

    Problem was that coordinator fails when process the fail message about 
itself. Reproducer attached to PR.
    I have fixed this issue by disabling removing itself from the ring (like as 
on node leaving). When coordinator process message it will send verify message 
across ringĀ and nodes will remove him from ring map. The new coordinator will 
send the discard message and ends the node fail process.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NSAmelchev/ignite ignite-5115

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/ignite/pull/5391.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5391
    
----
commit e8b41ba1886d736ee46be0559caa230e26e55936
Author: NSAmelchev <nsamelchev@...>
Date:   2018-11-14T09:09:14Z

    Fix coordinator fails

----


> Investigation of failing tests of coordinator node failure 
> -----------------------------------------------------------
>
>                 Key: IGNITE-5115
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5115
>             Project: Ignite
>          Issue Type: Task
>          Components: messaging
>            Reporter: Sergey Chugunov
>            Assignee: Amelchev Nikita
>            Priority: Major
>
> Tests *customEventCoordinatorFailure1/2* from *TcpDiscoverySelfTest* are 
> flaky on TC and sometimes hang with the following assertion in logs:
> {code}
> Exception in thread "tcp-disco-msg-worker-#5245%tcp.TcpDiscoverySelfTest0%" 
> java.lang.AssertionError
>       at 
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.removeNode(TcpDiscoveryNodesRing.java:353)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeFailedMessage(ServerImpl.java:4670)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2567)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2366)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6485)
>       at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2456)
>       at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {code}
> It seems that this happens because tests' implementation drops connections of 
> *TcpCommunicatonSpi* on coordinator node with *simulateNodeFailure* method.
> At the same time tests leave *TcpDiscoverySpi* operational; it receives 
> subsequent NodeFailed message and throws the assertion error shown above.
> The whole situation looks legitimate as it is possible to imagine a situation 
> when CommSPI connections on coordinator fail for some reason while DiscoSPI 
> connections are healthy.
> It is needed to investigate the situation deeper, figure out whether the root 
> cause is using of *simulateNodeFailure* or not and propose a solution if the 
> error may happen in the real life.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to