[ 
https://issues.apache.org/jira/browse/FLINK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823017#comment-17823017
 ] 

junzhong qin commented on FLINK-34526:
--------------------------------------

[~xtsong] Thanks for the detailed information. I will prepare the PR.

> Actively disconnect the killed TM in RM to reduce restart time
> --------------------------------------------------------------
>
>                 Key: FLINK-34526
>                 URL: https://issues.apache.org/jira/browse/FLINK-34526
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: junzhong qin
>            Assignee: junzhong qin
>            Priority: Not a Priority
>         Attachments: image-2024-02-27-15-50-25-071.png, 
> image-2024-02-27-15-50-39-337.png
>
>
> In our test case, the pipeline is:
> !image-2024-02-27-15-50-25-071.png!
>  # parallelism = 100
>  # taskmanager.numberOfTaskSlots = 2
>  # disable checkpoint
> h3. Phenomenon
> When the worker was killed at 2024-02-27 15:10:13,691
> {code:java}
> 2024-02-27 15:10:13,691 INFO  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
> Worker container_e2472_1706081484717_60538_01_000050 is terminated. 
> Diagnostics: Container container_e2472_1706081484717_60538_01_000050 marked 
> as failed. Exit code:137. Diagnostics:[2024-02-27 15:10:12.720]Container 
> killed on request. Exit code is 137[2024-02-27 15:10:12.763]Container exited 
> with a non-zero exit code 137. [2024-02-27 15:10:12.839]Killed by external 
> signal {code}
> It took about 20 seconds to restart the job.
> {code:java}
> 2024-02-27 15:10:30,749 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
> datagen_source[1] -> Sink: print_sink[2] (70/100) 
> (2a1d06e6610bc499475fa6e647f8cac9_d3f21cabc6fe0fdf76c8be915bdb22a2_69_0) 
> switched from RUNNING to FAILED on 
> container_e2472_1706081484717_60538_01_000050 @ xxx 
> (dataPort=38597).org.apache.flink.runtime.jobmaster.JobMasterException: 
> TaskManager with id container_e2472_1706081484717_60538_01_000050(xxx:5454) 
> is no longer reachable.
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1515)
>     at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>     at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>     at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>     at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>     at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)........//
>  Deploy and run task
> 2024-02-27 15:10:32,426 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
> datagen_source[1] -> Sink: print_sink[2] (70/100) 
> (2a1d06e6610bc499475fa6e647f8cac9_d3f21cabc6fe0fdf76c8be915bdb22a2_69_1) 
> switched from DEPLOYING to INITIALIZING.2024-02-27 15:10:32,427 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
> datagen_source[1] -> Sink: print_sink[2] (69/100) 
> (2a1d06e6610bc499475fa6e647f8cac9_d3f21cabc6fe0fdf76c8be915bdb22a2_68_1) 
> switched from DEPLOYING to INITIALIZING.2024-02-27 15:10:33,347 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
> datagen_source[1] -> Sink: print_sink[2] (70/100) 
> (2a1d06e6610bc499475fa6e647f8cac9_d3f21cabc6fe0fdf76c8be915bdb22a2_69_1) 
> switched from INITIALIZING to RUNNING.2024-02-27 15:10:33,421 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
> datagen_source[1] -> Sink: print_sink[2] (69/100) 
> (2a1d06e6610bc499475fa6e647f8cac9_d3f21cabc6fe0fdf76c8be915bdb22a2_68_1) 
> switched from INITIALIZING to RUNNING. {code}
>  
> h3. Reason
> When the RM received the message the TM was killed, the JobMaster still kept 
> the connection with the killed TM.  And the JobMaster found the TM is no 
> longer reachable after about 17 seconds.
> h3. Solution:
> We can reduce the restart time by disConnectTaskManager actively in 
> ResourceManager
> {code:java}
> // class ResourceManager
> protected Optional<WorkerType> closeTaskManagerConnection(
>         final ResourceID resourceID, final Exception cause) {
>    // ....
>    // using JobManagerGateway to actively disconnect TaskManager 
>     
> workerRegistration.getTaskExecutorGateway().disconnectResourceManager(cause);
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to