[
https://issues.apache.org/jira/browse/IGNITE-28255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Abashev updated IGNITE-28255:
----------------------------------
Description: (was: Summary:
MarshallerCacheJobRunNodeRestartTest.testJobRun fails intermittently with
timeout
Description:
The test MarshallerCacheJobRunNodeRestartTest.testJobRun hangs and times out
after 5 minutes (300 000 ms).
TC link:
https://ci2.ignite.apache.org/test/381112157178694638?currentProjectId=IgniteTests24Java8&branch=%3Cdefault%3E
Failure rate: 2 failures out of 68 runs (~3%), both on aitc-lin15, branch
refs/heads/master (builds #41053, #41049)
Root cause (from thread dump):
The test runner thread
test-runner-#83435%cache.MarshallerCacheJobRunNodeRestartTest% is stuck in
WAITING state inside GridTestUtils.runMultiThreaded() at Thread.join(), waiting
for worker threads that never finish:
Thread [name="test-runner-#83435%cache.MarshallerCacheJobRunNodeRestartTest%",
state=WAITING]
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1304)
at o.a.i.testframework.GridTestUtils.runMultiThreaded(GridTestUtils.java:1124)
at
o.a.i.i.processors.cache.MarshallerCacheJobRunNodeRestartTest.testJobRun(MarshallerCacheJobRunNodeRestartTest.java:65)
The main thread holds multiple ReentrantReadWriteLock instances (13 locked
synchronizers visible in the dump).
Additionally, a suspicious warning appears in the log just before the hang:
Joining node doesn't have stored group keys
[node=03e08542-cd7b-4a95-a9fe-bae553f00004]
This suggests a worker thread may be stuck waiting for group key exchange to
complete during node restart, which never finishes — causing the entire
runMultiThreaded call to hang indefinitely.
Environment:
- Ignite version: 2.18.0-SNAPSHOT#20260317
- JVM: OpenJDK 17.0.8.1+1 Eclipse Adoptium
- OS: Linux 5.4.0-216-generic amd64
- Agent: aitc-lin15
Steps to investigate:
1. Check why the restarted node doesn't have stored group keys — is the key
exchange protocol completing correctly during
MarshallerCacheJobRunNodeRestartTest?
2. Identify which worker thread inside runMultiThreaded is blocked and why it
never returns
3. Check for a race condition between node restart and group key propagation in
the marshaller cache)
> Fix java.io.NotSerializableException:
> org.apache.ignite.internal.processors.marshaller.MarshallerMappingItem
> ------------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-28255
> URL: https://issues.apache.org/jira/browse/IGNITE-28255
> Project: Ignite
> Issue Type: Bug
> Reporter: Alex Abashev
> Assignee: Alex Abashev
> Priority: Major
> Labels: IEP-132
> Fix For: 2.19
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)