Pavel Kovalenko created IGNITE-12255:
----------------------------------------
Summary: Cache affinity fetching and calculation on client nodes
may be broken in some cases
Key: IGNITE-12255
URL: https://issues.apache.org/jira/browse/IGNITE-12255
Project: Ignite
Issue Type: Bug
Components: cache
Affects Versions: 2.7, 2.5
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko
Fix For: 2.8
We have a cluster with server and client nodes.
We dynamically start several caches on a cluster.
Periodically we create and destroy some temporary cache in a cluster to move up
cluster topology version.
At the same time, a random client node chooses a random existing cache and
performs operations on that cache.
It leads to an exception on client node that affinity is not initialized for a
cache during cache operation like:
Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]
This exception means that the last affinity for a cache is calculated on
version [8,2]. This is a cache start version. It happens because during
creating/destroying some temporary cache we don’t re-calculate affinity for all
existing but not already accessed caches on client nodes. Re-calculate in this
case is cheap - we just copy affinity assignment and increment topology version.
As a solution, we need to fetch affinity on client node join for all caches.
Also, we need to re-calculate affinity for all affinity holders (not only for
started caches or only configured caches) for all topology events that happened
in a cluster on a client node.
This solution showed the existing race between client node join and concurrent
cache destroy.
The race is the following:
Client node (with some configured caches) joins to a cluster sending
SingleMessage to coordinator during client PME. This SingleMessage contains
affinity fetch requests for all cluster caches. When SingleMessage is in-flight
server nodes finish client PME and also process and finish cache destroy PME.
When a cache is destroyed affinity for that cache is cleared. When
SingleMessage delivered to coordinator it doesn’t have affinity for a requested
cache because the cache is already destroyed. It leads to assertion error on
the coordinator and unpredictable behavior on the client node.
The race may be fixed with the following change:
If the coordinator doesn’t have an affinity for requested cache from the client
node, it doesn’t break PME with assertion error, just doesn’t send affinity for
that cache to a client node. When the client node receives FullMessage and sees
that affinity for some requested cache doesn’t exist, it just closes cache
proxy for user interactions which throws CacheStopped exception for every
attempt to use that cache. This is safe behavior because cache destroy event
should be happened on the client node soon and destroy that cache completely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)