[
https://issues.apache.org/jira/browse/IGNITE-28771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088556#comment-18088556
]
Aleksey Plekhanov commented on IGNITE-28771:
--------------------------------------------
If grid contains, for example, two nodes, scenario can be like this:
* First node stops and leaves the grid.
* PME starts on the second node.
* The stop command is invoked on the second node.
* When the node enters
GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator method (by
"exchange-worker" thread) it acquires busyLock, assuming that node shutdown
will wait until the method completes execution.
* However, within this method, an asynchronous call to doInParallel occurs,
submitting tasks to the system executor.
* In parallel, the node's stop procedure reaches
GridCachePartitionExchangeManager.onKernelStop0(), interrupting the
“exchange-worker”.
* The exchange worker awaits completion of futures created by doInParallel.
Upon interruption, the busyLock is released, allowing further progress in node
shutdown beyond GridCachePartitionExchangeManager.stop0.
* During node shutdown, the process proceeds through
IgniteCacheDatabaseSharedManager.stop0() where off-heap memory is deallocated.
* Meanwhile, tasks, started by "exchange-worker" in system executor attempt to
create new partitions and allocate pages from page memory. This leads to
accessing already-deallocated memory regions, causing a JVM crash.
The issue has been reproduced using various tests across multiple TC agents,
although the likelihood of reproducing it is highest on the lin-02 agent.
Also, AuthorizationIntegrationTest is one of the tests with relatively high
rate of reproduction.
I've set up a test suite with 500 runs of the AuthorizationIntegrationTest and
this suite crashes the JVM typically within 1 minute of execution.
However, when the busyLock is acquired in threads spawned by the doInParallel
method, the suite successfully completes after running for over twenty minutes.
See
[https://ci2.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_CalciteSql3&branch_IgniteTests24Java8=pull%2F13237%2Fhead&tab=buildTypeStatusDiv]
!image-2026-06-12-18-41-23-190.png|width=980,height=443!
> Investigate JVM crash on TC
> ---------------------------
>
> Key: IGNITE-28771
> URL: https://issues.apache.org/jira/browse/IGNITE-28771
> Project: Ignite
> Issue Type: Bug
> Reporter: Aleksey Plekhanov
> Assignee: Aleksey Plekhanov
> Priority: Major
> Labels: ise
> Attachments: image-2026-06-12-18-41-23-190.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> There are JVM crashed sometimes on TC on calcite/cache suites. In most cases
> problematic frame shown as:
> {noformat}
> # Problematic frame:
> # v ~StubRoutines::jlong_disjoint_arraycopy
> {noformat}
> Sometimes crash dumps also refer to PageMemoryNoStoreImpl::allocatePage
> method.
> Once there was a extended stack trace in crash dump:
> {noformat}
> Native frames: (J=compiled Java code, A=aot compiled Java code,
> j=interpreted, Vv=VM code, C=native code)
> J 136663 c1
> org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl$Segment.allocateFreePage(I)J
> (197 bytes) @ 0x00007fa21564b789 [0x00007fa21564b160+0x0000000000000629]
> J 136660 c1
> org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.allocatePage(IIB)J
> (337 bytes) @ 0x00007fa219aa69c4 [0x00007fa219aa6820+0x00000000000001a4]
> J 75880 c2
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.createCacheDataStore0(I)Lorg/apache/ignite/internal/processors/cache/IgniteCacheOffheapManager$CacheDataStore;
> (114 bytes) @ 0x00007fa2218a4af0 [0x00007fa2218a1de0+0x0000000000002d10]
> J 75217 c2
> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtLocalPartition.<init>(Lorg/apache/ignite/internal/processors/cache/GridCacheSharedContext;Lorg/apache/ignite/internal/processors/cache/CacheGroupContext;IZ)V
> (423 bytes) @ 0x00007fa2217b2fe0 [0x00007fa2217b1520+0x0000000000001ac0]
> J 75237 c2
> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.getOrCreatePartition(I)Lorg/apache/ignite/internal/processors/cache/distributed/dht/topology/GridDhtLocalPartition;
> (166 bytes) @ 0x00007fa22181c748 [0x00007fa22181c480+0x00000000000002c8]
> J 118104 c2
> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.createPartitions(Lorg/apache/ignite/internal/processors/affinity/AffinityTopologyVersion;Ljava/util/List;J)V
> (229 bytes) @ 0x00007fa21de50250 [0x00007fa21de4fb20+0x0000000000000730]
> J 119238 c2
> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.initPartitions(Lorg/apache/ignite/internal/processors/affinity/AffinityTopologyVersion;Ljava/util/List;Lorg/apache/ignite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture;J)Z
> (629 bytes) @ 0x00007fa21f9e2114 [0x00007fa21f9e1ee0+0x0000000000000234]
> J 134697 c1
> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.beforeExchange(Lorg/apache/ignite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture;ZZ)V
> (1096 bytes) @ 0x00007fa21a79437c [0x00007fa21a792cc0+0x00000000000016bc]
> J 137140 c1
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$$Lambda$1711.apply(Ljava/lang/Object;)Ljava/lang/Object;
> (12 bytes) @ 0x00007fa219ab7cac [0x00007fa219ab7be0+0x00000000000000cc]
> {noformat}
> Looks like there can be a race between PME and node stop:
--
This message was sent by Atlassian Jira
(v8.20.10#820010)