[ https://issues.apache.org/jira/browse/FLINK-38223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013178#comment-18013178 ]
Chesnay Schepler edited comment on FLINK-38223 at 8/12/25 12:13 PM: -------------------------------------------------------------------- Running this locally you quickly run into main thread violation errors. {code} 1752 [pool-102-thread-1] WARN org.apache.flink.runtime.rpc.MainThreadValidatorUtil [] - Violation of main thread constraint detected: expected <Thread[#138,ForkJoinPool-51-worker-1,5,main]> but running in <Thread[#139,pool-102-thread-1,5,main]>. java.lang.Exception: Violation of main thread constraint detected: expected <Thread[#138,ForkJoinPool-51-worker-1,5,main]> but running in <Thread[#139,pool-102-thread-1,5,main]>. {code} These happen during the deployment of executions causing them to fail. This was probably a pre-existing defect, and FLINK-38114 possibly started triggering these as it added more switching between the main thread and io executor. In practice this means all these tests need to be rewritten to not call directly into the scheduler/EG, but create a proper main thread executor. Specifically, all tests that use {{ComponentMainThreadExecutorServiceAdapter#forMainThread}} from the test thread and call into the execution graph / scheduler components are susceptible. was (Author: zentol): Running this locally you quickly run into main thread violation errors. {code} 1752 [pool-102-thread-1] WARN org.apache.flink.runtime.rpc.MainThreadValidatorUtil [] - Violation of main thread constraint detected: expected <Thread[#138,ForkJoinPool-51-worker-1,5,main]> but running in <Thread[#139,pool-102-thread-1,5,main]>. java.lang.Exception: Violation of main thread constraint detected: expected <Thread[#138,ForkJoinPool-51-worker-1,5,main]> but running in <Thread[#139,pool-102-thread-1,5,main]>. {code} These happen during the deployment of executions causing them to fail. This was probably a pre-existing defect, and FLINK-38114 possibly started triggering these as it added more switching between the main thread and io executor. In practice this means all these tests need to be rewritten to not call directly into the scheduler/EG, but create a proper main thread executor. > ExecutionGraphRestartTest and ExecutionGraphCoLocationRestartTest are flaky > on master > ------------------------------------------------------------------------------------- > > Key: FLINK-38223 > URL: https://issues.apache.org/jira/browse/FLINK-38223 > Project: Flink > Issue Type: Bug > Components: Tests > Affects Versions: 2.1 > Reporter: Gustavo de Morais > Priority: Major > Fix For: 2.2 > > > Both these suites are really flaky on master. Tests like > testConstraintsAfterRestart and testCancelWhileFailing are constantly failing > CI pipelines with errors like. > You can reproduce it locally by running the suite locally. > {code:java} > Aug 11 00:04:37 00:04:37.047 [ERROR] Errors: > Aug 11 00:04:37 00:04:37.047 [ERROR] > ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart:113 » Timeout > Not all executions fulfilled the predicate in time. {code} > {code:java} > org.opentest4j.AssertionFailedError: expected: RUNNING but was: > FAILINGExpected :RUNNINGActual :FAILING<Click to see difference> > at > org.apache.flink.runtime.executiongraph.ExecutionGraphRestartTest.testCancelWhileFailing(ExecutionGraphRestartTest.java:217) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) at > java.base/java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:373) > at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java) > at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > Suppressed: java.lang.IllegalStateException: Free slot must not be > used. at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) > at > org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releaseSlots(DefaultDeclarativeSlotPool.java:564) > at > org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.freeAndReleaseSlots(DefaultDeclarativeSlotPool.java:507) > at > org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releaseSlots(DefaultDeclarativeSlotPool.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.internalReleaseTaskManager(DeclarativeSlotPoolService.java:281) > at > org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.releaseAllTaskManagers(DeclarativeSlotPoolService.java:271) > at > org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.close(DeclarativeSlotPoolService.java:160) > at > org.apache.flink.runtime.executiongraph.ExecutionGraphRestartTest.testCancelWhileFailing(ExecutionGraphRestartTest.java:200) > ... 7 more > {code} > {code:java} > java.util.concurrent.TimeoutException: Not all executions fulfilled the > predicate in time. > at > org.apache.flink.runtime.executiongraph.ExecutionGraphTestUtils.waitForAllExecutionsPredicate(ExecutionGraphTestUtils.java:203) > at > org.apache.flink.runtime.executiongraph.ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart(ExecutionGraphCoLocationRestartTest.java:113) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) at > java.base/java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:373) > at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java) > at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > {code} > CI Link example > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=69283&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=1ffc5ec2-7913-50ff-0177-3fca16f1b8f0] > -- This message was sent by Atlassian Jira (v8.20.10#820010)