[jira] [Commented] (IGNITE-4111) Communication fails to send message if target node did not finish join process

Ilya Lantukh (JIRA) Thu, 22 Nov 2018 03:50:35 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695805#comment-16695805
 ]


Ilya Lantukh commented on IGNITE-4111:
--------------------------------------

[~NSAmelchev], thanks for the contribution!

I've reviewed your PR, it looks good. However, I would prefer to have a more 
precise test. Currently in IgniteTcpCommunicationBigClusterTest you just create 
an artificial latency and start multiple nodes, hoping that you will end up in 
the scenario mentioned in ticket's description. Please check if it is possible 
to re-write it so it will ensure such scenario using synchronization mechanics 
(like CountDownLatch) and make it more deterministic. Also, please give the 
test more meaningful name.

> Communication fails to send message if target node did not finish join process
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-4111
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4111
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>            Reporter: Semen Boikov
>            Assignee: Amelchev Nikita
>            Priority: Minor
>             Fix For: 2.8
>
>         Attachments: test onFirstMessage hang.log
>
>
> Currently this scenario is possible:
> - joining node sent join request and waits for 
> TcpDiscoveryNodeAddFinishedMessage inside ServerImpl.joinTopology
> - others nodes already see this node and can send messages to it (for example 
> try to run compute job on this node)
> - joining node can not receive message: TcpCommunicationSpi will hang inside 
> 'onFirstMessage' on 'getSpiContext' call, so sending node will get error 
> trying to establish connection
> Possible fix: if in onFirstMessage() spi context is not available, then 
> TcpCommunicationSpi  should send special response which indicates that this 
> node is not ready yet, and sender should retry after some time.
> Also need check internal code for places where message can be unnecessarily 
> sent to node: one such place is 
> GridCachePartitionExchangeManager.refreshPartitions - message is sent to all 
> known nodes, but here we can filter by node order / finished exchage version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-4111) Communication fails to send message if target node did not finish join process

Reply via email to