[
https://issues.apache.org/jira/browse/IGNITE-12255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944589#comment-16944589
]
Ignite TC Bot commented on IGNITE-12255:
----------------------------------------
{panel:title=Branch: [pull/6933/head] Base: [master] : No blockers
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--> Run :: All*
Results|https://ci.ignite.apache.org/viewLog.html?buildId=4658217&buildTypeId=IgniteTests24Java8_RunAll]
> Cache affinity fetching and calculation on client nodes may be broken in some
> cases
> -----------------------------------------------------------------------------------
>
> Key: IGNITE-12255
> URL: https://issues.apache.org/jira/browse/IGNITE-12255
> Project: Ignite
> Issue Type: Bug
> Components: cache
> Affects Versions: 2.5, 2.7
> Reporter: Pavel Kovalenko
> Assignee: Pavel Kovalenko
> Priority: Critical
> Fix For: 2.8
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> We have a cluster with server and client nodes.
> We dynamically start several caches on a cluster.
> Periodically we create and destroy some temporary cache in a cluster to move
> up cluster topology version.
> At the same time, a random client node chooses a random existing cache and
> performs operations on that cache.
> It leads to an exception on client node that affinity is not initialized for
> a cache during cache operation like:
> Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]
> This exception means that the last affinity for a cache is calculated on
> version [8,2]. This is a cache start version. It happens because during
> creating/destroying some temporary cache we don’t re-calculate affinity for
> all existing but not already accessed caches on client nodes. Re-calculate in
> this case is cheap - we just copy affinity assignment and increment topology
> version.
> As a solution, we need to fetch affinity on client node join for all caches.
> Also, we need to re-calculate affinity for all affinity holders (not only for
> started caches or only configured caches) for all topology events that
> happened in a cluster on a client node.
> This solution showed the existing race between client node join and
> concurrent cache destroy.
> The race is the following:
> Client node (with some configured caches) joins to a cluster sending
> SingleMessage to coordinator during client PME. This SingleMessage contains
> affinity fetch requests for all cluster caches. When SingleMessage is
> in-flight server nodes finish client PME and also process and finish cache
> destroy PME. When a cache is destroyed affinity for that cache is cleared.
> When SingleMessage delivered to coordinator it doesn’t have affinity for a
> requested cache because the cache is already destroyed. It leads to assertion
> error on the coordinator and unpredictable behavior on the client node.
> The race may be fixed with the following change:
> If the coordinator doesn’t have an affinity for requested cache from the
> client node, it doesn’t break PME with assertion error, just doesn’t send
> affinity for that cache to a client node. When the client node receives
> FullMessage and sees that affinity for some requested cache doesn’t exist, it
> just closes cache proxy for user interactions which throws CacheStopped
> exception for every attempt to use that cache. This is safe behavior because
> cache destroy event should be happened on the client node soon and destroy
> that cache completely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)