[
https://issues.apache.org/jira/browse/FLINK-27350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chenyu Zheng updated FLINK-27350:
---------------------------------
Affects Version/s: 1.13.2
> JobManager doesn't bring up new TaskManager during failure recovery
> -------------------------------------------------------------------
>
> Key: FLINK-27350
> URL: https://issues.apache.org/jira/browse/FLINK-27350
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.13.2
> Reporter: Chenyu Zheng
> Priority: Major
> Attachments: jobmanager.log,
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10.log
>
>
> I got a strange bug during failure recovery of Flink. It seems the JobManager
> doesn't bring up new TaskManager during failure recovery. Some logs and
> information of the Flink job are pasted below. Can you take a look and give
> me some guidance? Thank you so much!
>
> Flink version: 1.13.2
> Deploy mode: K8s native
> Timeline of the bug:
> # Flink job start to work with 8 taskmanagers.
> # At {*}2022-04-17 00:28:15,286{*}, this job got an error and JobManager
> decided to restart 2 tasks (pod
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1,
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
> # The two old pod is stopped and JobManager created 2 pod (pod
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9,
> stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17
> 00:33:15,376*
> # JobManager discard two new pods’ registration at *2022-04-17 00:33:32,393*
> # These new pods exited at {*}2022-04-17 00:33:32,396{*}, due to the
> rejection of registration.
> # JobManager didn’t bring up new pods and print error “Slot request bulk is
> not fulfillable! Could not allocate the required slot within slot request
> timeout” over and over
--
This message was sent by Atlassian Jira
(v8.20.7#820007)