[ 
https://issues.apache.org/jira/browse/FLINK-38223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013178#comment-18013178
 ] 

Chesnay Schepler edited comment on FLINK-38223 at 8/12/25 12:13 PM:
--------------------------------------------------------------------

Running this locally you quickly run into main thread violation errors.

{code}
1752 [pool-102-thread-1] WARN  
org.apache.flink.runtime.rpc.MainThreadValidatorUtil [] - Violation of main 
thread constraint detected: expected 
<Thread[#138,ForkJoinPool-51-worker-1,5,main]> but running in 
<Thread[#139,pool-102-thread-1,5,main]>.
java.lang.Exception: Violation of main thread constraint detected: expected 
<Thread[#138,ForkJoinPool-51-worker-1,5,main]> but running in 
<Thread[#139,pool-102-thread-1,5,main]>.
{code}

These happen during the deployment of executions causing them to fail.

This was probably a pre-existing defect, and FLINK-38114 possibly started 
triggering these as it added more switching between the main thread and io 
executor.

In practice this means all these tests need to be rewritten to not call 
directly into the scheduler/EG, but create a proper main thread executor.
Specifically, all tests that use 
{{ComponentMainThreadExecutorServiceAdapter#forMainThread}} from the test 
thread and call into the execution graph / scheduler components are susceptible.


was (Author: zentol):
Running this locally you quickly run into main thread violation errors.

{code}
1752 [pool-102-thread-1] WARN  
org.apache.flink.runtime.rpc.MainThreadValidatorUtil [] - Violation of main 
thread constraint detected: expected 
<Thread[#138,ForkJoinPool-51-worker-1,5,main]> but running in 
<Thread[#139,pool-102-thread-1,5,main]>.
java.lang.Exception: Violation of main thread constraint detected: expected 
<Thread[#138,ForkJoinPool-51-worker-1,5,main]> but running in 
<Thread[#139,pool-102-thread-1,5,main]>.
{code}

These happen during the deployment of executions causing them to fail.

This was probably a pre-existing defect, and FLINK-38114 possibly started 
triggering these as it added more switching between the main thread and io 
executor.

In practice this means all these tests need to be rewritten to not call 
directly into the scheduler/EG, but create a proper main thread executor.

> ExecutionGraphRestartTest and ExecutionGraphCoLocationRestartTest are flaky 
> on master
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-38223
>                 URL: https://issues.apache.org/jira/browse/FLINK-38223
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.1
>            Reporter: Gustavo de Morais
>            Priority: Major
>             Fix For: 2.2
>
>
> Both these suites are really  flaky on master. Tests like 
> testConstraintsAfterRestart and testCancelWhileFailing are constantly failing 
> CI pipelines with errors like.
> You can reproduce it locally by running the suite locally.
> {code:java}
> Aug 11 00:04:37 00:04:37.047 [ERROR] Errors: 
> Aug 11 00:04:37 00:04:37.047 [ERROR]   
> ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart:113 » Timeout 
> Not all executions fulfilled the predicate in time. {code}
> {code:java}
> org.opentest4j.AssertionFailedError: expected: RUNNING but was: 
> FAILINGExpected :RUNNINGActual   :FAILING<Click to see difference>
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphRestartTest.testCancelWhileFailing(ExecutionGraphRestartTest.java:217)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:568)   at 
> java.base/java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:373)
>   at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java)    
>     at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
>    at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)     
> at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) 
>        at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
>        Suppressed: java.lang.IllegalStateException: Free slot must not be 
> used.                at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)        
>        at 
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releaseSlots(DefaultDeclarativeSlotPool.java:564)
>              at 
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.freeAndReleaseSlots(DefaultDeclarativeSlotPool.java:507)
>               at 
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releaseSlots(DefaultDeclarativeSlotPool.java:477)
>              at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.internalReleaseTaskManager(DeclarativeSlotPoolService.java:281)
>                at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.releaseAllTaskManagers(DeclarativeSlotPoolService.java:271)
>            at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.close(DeclarativeSlotPoolService.java:160)
>             at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphRestartTest.testCancelWhileFailing(ExecutionGraphRestartTest.java:200)
>          ... 7 more
>  {code}
> {code:java}
> java.util.concurrent.TimeoutException: Not all executions fulfilled the 
> predicate in time.
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphTestUtils.waitForAllExecutionsPredicate(ExecutionGraphTestUtils.java:203)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart(ExecutionGraphCoLocationRestartTest.java:113)
>         at java.base/java.lang.reflect.Method.invoke(Method.java:568)   at 
> java.base/java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:373)
>   at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java)    
>     at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
>    at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)     
> at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) 
>        at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
>  {code}
> CI Link example
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=69283&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=1ffc5ec2-7913-50ff-0177-3fca16f1b8f0]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to