Mikhail Petrov created IGNITE-27109:
---------------------------------------
Summary: IgniteCache#putAll may silently lose entries while any
node is leaving the cluster
Key: IGNITE-27109
URL: https://issues.apache.org/jira/browse/IGNITE-27109
Project: Ignite
Issue Type: Bug
Reporter: Mikhail Petrov
Assignee: Mikhail Petrov
IgniteCache#putAll call may succeed, but some of the specified entries will not
be stored in the cache. This may happen for ATOMIC caches when a node leaves
the cluster during IgniteCache#putAll execution. Even though putAll can
partially fail for atomic caches, user still should get
CachePartialUpdateException.
The problem is reproduced by ReliabilityTest.testFailover test. Cache
configuration: ATOMIC, REPLICATED, FULL_SYNC
See:
https://ci2.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8360567487297938069&tab=testDetails&branch_IgniteTests24Java8=%3Cdefault%3E
Explanation :
IgniteCache#putAll call may succeed, but some of the specified entries will not
be stored in the cache. This may happen for ATOMIC caches when a node leaves
the cluster during IgniteCache#putAll execution. Even though putAll can
partially fail for atomic caches, user still should get
CachePartialUpdateException.
The problem is reproduced by ReliabilityTest.testFailover test. Cache
configuration: ATOMIC, REPLICATED, FULL_SYNC
See:
https://ci2.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8360567487297938069&tab=testDetails&branch_IgniteTests24Java8=%3Cdefault%3E
Explanation :
Consider cluster with 3 nodes - node0, node1, node2
1. node0 accepts putAll request, maps all keys to corresponding primary nodes
and sends GridNearAtomicFullUpdateRequest to node1 and node2.
2. node1 starts processing cache entries. Halfway through this process node1
receives stop signal (Ignite#close). All remaining attempts to process cache
entries will fail with exception - see
IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#invoke and
IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#operationCancelledException.
3. node1 manages to sends GridDhtAtomicUpdateRequest with all processed entries
to backups - node2 and node0.
4. node1 fails to send GridNearAtomicUpdateResponse with failed keys to node0
because NIO was stopped. This message is an indication to the "near" node that
some keys could not be processed and the operation should be terminated with an
exception.
5. node0 and node2 process entries from GridDhtAtomicUpdateRequest`s and sends
GridDhtAtomicNearResponse`s to node0.
6. node1 is removed from the cluster.
7. node0 gets event that node1(primary node for some keys) left the cluster but
it received GridDhtAtomicNearResponse`s from all backups. So node0 does nothing
and eventually completes putAll operation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)