[ 
https://issues.apache.org/jira/browse/FLINK-27350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenyu Zheng updated FLINK-27350:
---------------------------------
    Affects Version/s: 1.13.2

> JobManager doesn't bring up new TaskManager during failure recovery
> -------------------------------------------------------------------
>
>                 Key: FLINK-27350
>                 URL: https://issues.apache.org/jira/browse/FLINK-27350
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.13.2
>            Reporter: Chenyu Zheng
>            Priority: Major
>         Attachments: jobmanager.log, 
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10.log
>
>
> I got a strange bug during failure recovery of Flink. It seems the JobManager 
> doesn't bring up new TaskManager during failure recovery. Some logs and 
> information of the Flink job are pasted below. Can you take a look and give 
> me some guidance? Thank you so much!
>  
> Flink version: 1.13.2
> Deploy mode: K8s native
> Timeline of the bug:
>  # Flink job start to work with 8 taskmanagers.
>  # At {*}2022-04-17 00:28:15,286{*}, this job got an error and JobManager 
> decided to restart 2 tasks (pod 
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1, 
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
>  # The two old pod is stopped and JobManager created 2 pod (pod 
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9, 
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17 
> 00:33:15,376*
>  # JobManager discard two new pods’ registration at *2022-04-17 00:33:32,393*
>  # These new pods exited at {*}2022-04-17 00:33:32,396{*}, due to the 
> rejection of registration.
>  # JobManager didn’t bring up new pods and print error “Slot request bulk is 
> not fulfillable! Could not allocate the required slot within slot request 
> timeout” over and over



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to