[jira] [Updated] (IGNITE-10898) Exchange coordinator failover breaks in some cases when node filter is used

Alexey Goncharuk (JIRA) Fri, 11 Jan 2019 06:28:47 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-10898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexey Goncharuk updated IGNITE-10898:
--------------------------------------
    Description: 
Currently if a node does not pass cache node filter, we do not store this cache 
affinity on the node unless the node is coordinator. This, however, may fail in 
the following scenario:
1) A node passing node filter joins cluster
2) During the join coordinator fails, new coordinator is selected for which 
previous exchange is completed
3) Next coordinator attempts to fetch the affinity, and joining node resends 
partitions single message, but there are two problems here. First, exchange 
fast-reply does not wait for the new affinity initialization which results in 
{{IllegalStateException}}. Second, such an attempt to fetch affinity may lead 
either to deadlock or to incorrectly fetched affinity (basically, coordinator 
must be in consensus with other nodes passing node filter)

Test attached reproduces the issue.

I suggest to always calculate and keep affinity on all nodes, even ones not 
passing the filter. In this case, there will be no need to fetch and 
recalculate affinity ({{initCoordinatorCaches}} will go away.

> Exchange coordinator failover breaks in some cases when node filter is used
> ---------------------------------------------------------------------------
>
>                 Key: IGNITE-10898
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10898
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexey Goncharuk
>            Priority: Major
>
> Currently if a node does not pass cache node filter, we do not store this 
> cache affinity on the node unless the node is coordinator. This, however, may 
> fail in the following scenario:
> 1) A node passing node filter joins cluster
> 2) During the join coordinator fails, new coordinator is selected for which 
> previous exchange is completed
> 3) Next coordinator attempts to fetch the affinity, and joining node resends 
> partitions single message, but there are two problems here. First, exchange 
> fast-reply does not wait for the new affinity initialization which results in 
> {{IllegalStateException}}. Second, such an attempt to fetch affinity may lead 
> either to deadlock or to incorrectly fetched affinity (basically, coordinator 
> must be in consensus with other nodes passing node filter)
> Test attached reproduces the issue.
> I suggest to always calculate and keep affinity on all nodes, even ones not 
> passing the filter. In this case, there will be no need to fetch and 
> recalculate affinity ({{initCoordinatorCaches}} will go away.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-10898) Exchange coordinator failover breaks in some cases when node filter is used

Reply via email to