[
https://issues.apache.org/jira/browse/IGNITE-28381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikhail Petrov updated IGNITE-28381:
------------------------------------
Summary: Node failure with Assertion Error during Atomic Cache operation
topology remapping (was: Assertion Error during Atomic Cache operation
topology remapping)
> Node failure with Assertion Error during Atomic Cache operation topology
> remapping
> ----------------------------------------------------------------------------------
>
> Key: IGNITE-28381
> URL: https://issues.apache.org/jira/browse/IGNITE-28381
> Project: Ignite
> Issue Type: Bug
> Reporter: Mikhail Petrov
> Assignee: Mikhail Petrov
> Priority: Major
> Labels: ise
> Fix For: 2.19
>
> Attachments: _Apache_Ignite_2.x_Tests_Thin_Client_Java_16197.log.zip
>
>
> The problem was reproduced during
> org.apache.ignite.client.ReliabilityTest#testFailover test on TC. See
> attachments for the full log.
> Exception:
> {code:java}
> java.lang.AssertionError
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture.onPrimaryResponse(GridNearAtomicUpdateFuture.java:400)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateResponse(GridDhtAtomicCache.java:3236)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:311)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:306)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1084)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:594)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:396)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1862)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1483)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1387)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.thread.pool.IgniteStripedExecutor$Stripe.body(IgniteStripedExecutor.java:638)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> java.base/java.lang.Thread.run(Thread.java:829)
> {code}
> The root cause of the assertion error:
> Consider a cluster consisting of 3 nodes - node0, node1, node2
> 1. node2 is gracefully stopped
> 2. PME is started on node0 and node 1
> 3. PME is completed on node 0. So its topology version is now V+1.
> 4. Ignite#close method is called to gracefully stop the node 1
> 5. PME future on node1 is cancelled with the following exception
> {code:java}
> org.apache.ignite.internal.NodeStoppingException: Node is stopping:
> e2c0e324-70b6-45b8-925d-fe9a9fa5d877
> at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.onKernalStop0(GridCachePartitionExchangeManager.java:842)
> at
> org.apache.ignite.internal.processors.cache.GridCacheSharedManagerAdapter.onKernalStop(GridCacheSharedManagerAdapter.java:131)
> at
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:819)
> at org.apache.ignite.internal.IgniteKernal.stop0(IgniteKernal.java:1782)
> at org.apache.ignite.internal.IgniteKernal.stop(IgniteKernal.java:1731)
> at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop0(IgnitionEx.java:2297)
> at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop(IgnitionEx.java:2117)
> at org.apache.ignite.internal.IgnitionEx.stop(IgnitionEx.java:308)
> at org.apache.ignite.Ignition.stop(Ignition.java:202)
> at org.apache.ignite.internal.IgniteKernal.close(IgniteKernal.java:2686)
> at
> org.apache.ignite.client.LocalIgniteCluster.failNode(LocalIgniteCluster.java:110)
> at
> org.apache.ignite.client.ReliabilityTest.lambda$assertOnUnstableCluster$17(ReliabilityTest.java:714)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> at
> java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
> {code}
> 6. GridCachePartitionExchangeManager#onExchangeDone and
> GridDhtPartitionTopology#onExchangeDone are not called on node 1 because the
> PME future was completed with the exception.
> 7.Even when stopped, node 1 can receive and process requests from other nodes
> for a short period of time. Node 1 (primary) receives
> GridDhtAtomicAbstractUpdateRequest from node 0 (near). Which was mapped by
> node 0 to V+1 topology version.
> 8. PME future was completed on node 1 so nothing prevents it from handling
> the request (see GridMessageListener#onMessage)
> 9. The cache request is handled by GridDhtAtomicCache#updateAllAsyncInternal0
> 10. GridDhtCacheAdapter#needRemap check returns true because cache request
> was mapped on V+1 topology version but node1 local version is V (see
> `cctx.topology().readyTopologyVersion()` call in
> GridCacheAffinityManager#isCompatibleWithCurrentTopologyVersion method and
> clause 6).
> 11. node1 sends response to node0 with remap topology version == V+1. See
> `res.remapTopologyVersion(top.lastTopologyChangeVersion());` call.
> `top.lastTopologyChangeVersion()` returns V+1 because even PME future failed,
> it was started and related to V+1 topology version.
> 12. node0 receives node1 response and fails with the mentioned above stack
> trace
> Actually it may differ whether single cache put is called or putAll.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)