[ 
https://issues.apache.org/jira/browse/IGNITE-11394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780376#comment-16780376
 ] 

Sergey Chugunov edited comment on IGNITE-11394 at 2/28/19 11:01 AM:
--------------------------------------------------------------------

[~agoncharuk],

As far as I understand the whole situation with infinite loop this patch fixes 
it.

But I don't see dropping of MetricsUpdateMessage as it is suggested in the 
description, it looks like we simply add the message to the tail of the queue.
Am I missing something? Do we need the code that actually drops the message?


was (Author: sergey-chugunov):
[~agoncharuk],

As far as I understand the whole situation with infinite loop this patch fixes 
it.

But I don't see dropping of MetricsUpdateMessage as it is suggested in the 
description, it looks like we simple add the message to the tail of the queue.
Am I missing something? Do we need the code that actually drops the message?

> Infinite No next node in topology messages during node restart scenario
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-11394
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11394
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Alexey Goncharuk
>            Assignee: Alexey Goncharuk
>            Priority: Major
>             Fix For: 2.8
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I observe a situation with the following symptoms during a cycled nodes 
> restart:
>  - A node being joining to the cluster sends join request, receives 
> NodeAddedMessage and awaits NodeAddFinishedMessage
>  - The node receives a metrics update message, the message is in the queue
>  - The whole cluster is being restarted, a new ring is formed
>  - The node re-sends the join request, it is successfully process by the ring
>  - The node added message is received by the joining node
>  - The node detects that it cannot send messages (failed nodes contains all 
> ring remote nodes)
>  - Sine there was already a metrics update message in the queue, the node 
> attempts to re-add the message to the queue. Since the metrics update message 
> is a high priority message, it is added to the head of the queue and the node 
> gets stuck in an infinite loop
> I suggest to drop metrics update message in {{sendMessageAcrossRing}} if we 
> see the {{No next node in topology}} situation.
> Another question is why don't we pass the collection of failed nodes to the 
> {{ring.hasRemoteNodes()}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to