[ 
https://issues.apache.org/jira/browse/FLINK-26400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499990#comment-17499990
 ] 

Zhu Zhu edited comment on FLINK-26400 at 3/2/22, 9:09 AM:
----------------------------------------------------------

I have done the test and the result looks good. The stopped TaskManager can be 
quickly identified by the JobManager to trigger a restart. I also tried 
directly `kill` a TaskManager (in SIGTERM way) and it also works well. 

Thanks for this improvement [~nsemmler]! I think it is very useful even for 
non-reactive mode. Because taking away a problematic machine is very common in 
production, and in our experience it is a common cause of Flink streaming job 
failures. This improvement can speed up the error detection, and hence speed up 
the job recovery.

I noticed 2 problems which are out of the scope of this test though.
1. When there is no resource and job is in CREATED state, I will get an error 
"Job failed during initialization of JobManager" when trying to open the job 
page, and cannot see the job topology or other job informations.
2. The parallelisms of graph nodes shown in the web UI is not updated when job 
vertex parallelisms change. I guess the root cause is the json plan is not 
updated on job vertex parallelism changes. Because we encountered the same 
issue when developing AdaptiveBatchScheduler. As a reference, our solution can 
be found 
[here|https://github.com/apache/flink/blob/152ad4fc14920372076c0004793c179141ae10c7/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java#L224].


was (Author: zhuzh):
I have done the test and the result looks good. The stopped TaskManager can be 
quickly identified by the JobManager to trigger a restart. I also tried 
directly `kill` a TaskManager (in SIGTERM way) and it also works well. 
Thanks for this improvement [~nsemmler]! I think it is very useful even for 
non-reactive mode. Because taking away a problematic machine is very common in 
production, and in our experience it is a common cause of Flink streaming job 
failures. This improvement can speed up the error detection, and hence speed up 
the job recovery.

I noticed 2 problems which are out of the scope of this test though.
1. When there is no resource and job is in CREATED state, I will get an error 
"Job failed during initialization of JobManager" when trying to open the job 
page, and cannot see the job topology or other job informations.
2. The parallelism shown in the graph is not updated when a job vertex 
parallelism changes. I guess the root cause is the json plan is not updated on 
job vertex parallelism changes. Because we encountered the same issue when 
developing AdaptiveBatchScheduler. As a reference, our solution can be found 
[here|https://github.com/apache/flink/blob/152ad4fc14920372076c0004793c179141ae10c7/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java#L224].

> Release Testing: Explicit shutdown signalling from TaskManager to JobManager
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-26400
>                 URL: https://issues.apache.org/jira/browse/FLINK-26400
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Niklas Semmler
>            Assignee: Zhu Zhu
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>
> FLINK-25277 introduces explicit signalling between a TaskManager and the 
> JobManager when the TaskManager shuts down. This reduces the time it takes 
> for a reactive cluster to down-scale & restart.
>  
> *Setup*
>  # Add the following line to your flink config to enable reactive mode:
> {code}
> taskmanager.host: localhost # a workaround
> scheduler-mode: reactive
> restart-strategy: fixeddelay
> restart-strategy.fixed-delay.attempts: 100
> {code}
>  # Create a “usrlib” folder and place the TopSpeedWindowing jar into it
> {code:bash}
> $ mkdir usrlib
> $ cp examples/streaming/TopSpeedWindowing.jar usrlib/
> {code}
>  # Start the job 
> {code:bash}
> $ bin/standalone-job.sh start  --main-class 
> org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
> {code}
>  # Start three task managers
> {code:bash}
> $ bin/taskmanager.sh start
> $ bin/taskmanager.sh start
> $ bin/taskmanager.sh start
> {code}
>  # Wait for the job to stabilize. The log file should show that three tasks 
> start for every operator.
> {code}
>  GlobalWindows -> Sink: Print to Std. Out (3/3) 
> (d10339d5755d07f3d9864ed1b2147af2) switched from INITIALIZING to 
> RUNNING.{code}
> *Test*
> Stop one taskmanager
> {code:bash}
> $ bin/taskmanager.sh stop
> {code}
> Success condition: You should see that the job cancels and re-runs after a 
> few seconds. In the logs you should find a line with the text “The 
> TaskExecutor is shutting down”.
> *Teardown*
> Stop all taskmanagers and the jobmanager:
> {code:bash}
> $ bin/standalone-job.sh stop
> $ bin/taskmanager.sh stop-all
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to