[ 
https://issues.apache.org/jira/browse/FLINK-12926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878302#comment-16878302
 ] 

Zhu Zhu edited comment on FLINK-12926 at 7/4/19 4:12 AM:
---------------------------------------------------------

Hi [~till.rohrmann], from my observation, the issue is happening though it does 
not break current tests.

Below are some cases it happens or may happen:

1. Even though the tests do not trigger actions from other thread, the 
production logic might do it, e.g. *Execution#deploy()* as shown in the 
attached picture, which happens in most of the tests mentioned above. It does 
not break tests since it is not in the critical path and the failed main thread 
checking does not cause failovers. 

!Execution#deploy.jpg|width=568,height=328!

2. The *TestingComponentMainThreadExecutorServiceAdapter* uses 
*DirectScheduledExecutorService* as the underlying ScheduledExecutorService. 
However, DirectScheduledExecutorService will schedule tasks from another 
thread. So if any *mainThreadExecutor.schedule** action is invoked in tests or 
production process, it may also violate the main thread checking. No test 
breaks for it yet. But I think we just fortunately dodged(Or intentional?). 
e.g. 

    -  FixedDelayRestartStrategy. No test breaks because no test uses 
FixedDelayRestartStrategy to do failover yet.

    -  HeartbeatMonitor. No test breaks because it does not check main thread, 
HeartbeatManagerTest#testHeartbeatTimeout actually does the timeout handling in 
another pool thread.

 

I'd be OK to close this issue as no test breaks yet, as long as we are already 
aware of this. 

The manual executor way as we explored in FLINK-12876  can be a solution for 
this case.

 


was (Author: zhuzh):
Hi [~till.rohrmann], from my observation, the issue is happening though it does 
not break current tests.

Below are some cases it happens or may happen:

1. Even though the tests do not trigger actions from other thread, the 
production logic might do it, e.g. *Execution#deploy()* as shown in the 
attached picture, which happens in most of the tests mentioned above. It does 
not break tests since it is not in the critical path and the failed main thread 
checking does not cause failovers.

2. Besides, the *TestingComponentMainThreadExecutorServiceAdapter* uses 
*DirectScheduledExecutorService* as the underlying ScheduledExecutorService. 
However, DirectScheduledExecutorService will schedule tasks from another 
thread. So if any mainThreadExecutor.schedule* action is invoked in tests or 
production process, it may also violate the main thread checking. No test 
breaks for it yet. But I think we just fortunately dodged(Or intentional?). 
e.g. 

    -  FixedDelayRestartStrategy. No test breaks because no test uses 
FixedDelayRestartStrategy to do failover yet.

    -  HeartbeatMonitor. No test breaks because it does not check main thread, 
HeartbeatManagerTest#testHeartbeatTimeout actually does the timeout handling in 
another pool thread.

 

I'd be OK to close this issue as no test breaks yet, as long as we are already 
aware of this. 

The manual executor way as we explored in [FLINK-12876 
|https://issues.apache.org/jira/browse/FLINK-12876] can be a solution for this 
case.

 

!Execution#deploy.jpg|width=568,height=328!

> Main thread checking in some tests fails
> ----------------------------------------
>
>                 Key: FLINK-12926
>                 URL: https://issues.apache.org/jira/browse/FLINK-12926
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.9.0
>            Reporter: Zhu Zhu
>            Priority: Major
>         Attachments: Execution#deploy.jpg, mainThreadCheckFailure.log
>
>
> Currently all JM side job changing actions are expected to be taken in 
> JobMaster main thread.
> In current Flink tests, many cases tend to use the test main thread as the JM 
> main thread. This can lead to 2 issues:
> 1. TestingComponentMainThreadExecutorServiceAdapter is a direct executor, so 
> if it is invoked from any other thread, it will break the main thread 
> checking and fail the submitted action (as in the attached log 
> [^mainThreadCheckFailure.log])
> 2. The test main thread does not support other actions queued in its 
> executor, as the test will end once the current test thread action(the 
> current running test body) is done
>  
> In my observation, most cases which starts 
> ExecutionGraph.scheduleForExecution() will encounter this issue. Cases 
> include ExecutionGraphRestartTest, FailoverRegionTest, 
> ConcurrentFailoverStrategyExecutionGraphTest, GlobalModVersionTest, 
> ExecutionGraphDeploymentTest, etc.
>  
> One solution in my mind is to create a ScheduledExecutorService for those 
> tests, use it as the main thread and run the test body in this thread.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to