[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after recovery from failure in HA Mode

Gyula Fora (Jira) Tue, 09 Jan 2024 11:47:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804840#comment-17804840
 ]


Gyula Fora edited comment on FLINK-34007 at 1/9/24 7:46 PM:
------------------------------------------------------------

>From initial investigation, the job manager is initially lose the leadership, 
>then goes to SUSPENDED status. Shouldn't the job manager exit directly rather 
>than goes to SUSPENDED status?

 


was (Author: zhenqiuhuang):
>From initial investigation, the job manager is initially lose the leadership, 
>then goes to SUSPENDED status. Shouldn't the job manager exit directly rather 
>than goes to SUSPENDED status?

2024-01-08 21:44:57,142 INFO  
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner [] - 
JobMasterServiceLeadershipRunner for job 217cee964b2cfdc3115fb74cac0ec550 was 
revoked leadership with leader id 9987190b-35f4-4238-b317-057dc3615e4d. 
Stopping current JobMasterServiceProcess.
2024-01-08 21:45:16,280 INFO  
org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - 
http://172.16.197.136:8081 lost leadership
2024-01-08 21:45:16,280 INFO  
org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
Resource manager service is revoked leadership with session id 
9987190b-35f4-4238-b317-057dc3615e4d.
2024-01-08 21:45:16,281 INFO  
org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - 
DefaultDispatcherRunner was revoked the leadership with leader id 
9987190b-35f4-4238-b317-057dc3615e4d. Stopping the DispatcherLeaderProcess.
2024-01-08 21:45:16,282 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Stopping SessionDispatcherLeaderProcess.
2024-01-08 21:45:16,282 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Stopping 
dispatcher pekko.tcp://[email protected]:6123/user/rpc/dispatcher_1.
2024-01-08 21:45:16,282 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Stopping all 
currently running jobs of dispatcher 
pekko.tcp://[email protected]:6123/user/rpc/dispatcher_1.
2024-01-08 21:45:16,282 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Stopping the JobMaster for job 
'amp-ade-fitness-clickstream-projection-uat' (217cee964b2cfdc3115fb74cac0ec550).
2024-01-08 21:45:16,285 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 
217cee964b2cfdc3115fb74cac0ec550 reached terminal state SUSPENDED.
2024-01-08 21:45:16,286 INFO  
org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - 
Stopping credential renewal
2024-01-08 21:45:16,286 INFO  
org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - 
Stopped credential renewal
2024-01-08 21:45:16,286 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Closing the slot manager.
2024-01-08 21:45:16,286 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Suspending the slot manager.
2024-01-08 21:45:16,287 INFO  
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Stopping DefaultLeaderRetrievalService.
2024-01-08 21:45:16,287 INFO  
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] 
- Stopping 
KubernetesLeaderRetrievalDriver{configMapName='acsflink-5e92d541f0cd0ad7352c4dc5463c54df-cluster-config-map'}.
2024-01-08 21:45:16,287 INFO  
org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer
 [] - Stopped to watch for 
amp-ae-video-uat/acsflink-5e92d541f0cd0ad7352c4dc5463c54df-cluster-config-map, 
watching id:cc34317a-3299-4cb5-a966-55cb546e8bf9
2024-01-08 21:45:16,287 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
amp-ade-fitness-clickstream-projection-uat (217cee964b2cfdc3115fb74cac0ec550) 
switched from state RUNNING to SUSPENDED.

> Flink Job stuck in suspend state after recovery from failure in HA Mode
> -----------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after recovery from failure in HA Mode

Reply via email to