john created FLINK-25832:
----------------------------
Summary: When the TaskManager is closed, its associated slot is
not set to the released state.
Key: FLINK-25832
URL: https://issues.apache.org/jira/browse/FLINK-25832
Project: Flink
Issue Type: Bug
Components: Runtime / Task
Affects Versions: 1.14.3, 1.14.2
Reporter: john
Attachments: image-2022-01-27-10-55-14-758.png,
image-2022-01-27-10-55-59-119.png, image-2022-01-27-10-57-26-223.png
I deployed a standalone flink cluster on k8s and enabled
scheduler-mode=reactive. When Taskmanager is closed, I actively call the
closeTaskManagerConnection method of ResourceManager. However, when
AdaptiveScheduler actively starts to restart the job, it calls the cancel
method of Execution at this time, but this method does not judge whether the
status of its associated slot is Alive. The Taskmanager to which this slot
belongs has been closed, and RpcTimeout is triggered at this time.
But when I change the cancel method of Execution, after judging whether the
status of the slot is Alive before cancel, repeating the above operation is
still invalid, that is, RpcTimeout will still be triggered. My problem is:
Active in the ResourceManager's closeTaskManagerConnection method, does not
affect the state of its associated allocated slot. I think this is a bug. We
should optimize the behavior of cancel to speed up the execution of cancel.
!image-2022-01-27-10-55-59-119.png!
!image-2022-01-27-10-57-26-223.png!!image-2022-01-27-10-55-14-758.png!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)