[
https://issues.apache.org/jira/browse/FLINK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski updated FLINK-25832:
-----------------------------------
Component/s: Runtime / Coordination
(was: Runtime / Task)
> When the TaskManager is closed, its associated slot is not set to the
> released state.
> -------------------------------------------------------------------------------------
>
> Key: FLINK-25832
> URL: https://issues.apache.org/jira/browse/FLINK-25832
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.2, 1.14.3
> Reporter: john
> Priority: Major
> Attachments: image-2022-01-27-10-55-14-758.png,
> image-2022-01-27-10-55-59-119.png, image-2022-01-27-10-57-26-223.png
>
>
> I deployed a standalone flink cluster on k8s and enabled
> scheduler-mode=reactive. When Taskmanager is closed, I actively call the
> closeTaskManagerConnection method of ResourceManager. However, when
> AdaptiveScheduler actively starts to restart the job, it calls the cancel
> method of Execution at this time, but this method does not judge whether the
> status of its associated slot is Alive. The Taskmanager to which this slot
> belongs has been closed, and RpcTimeout is triggered at this time.
> But when I change the cancel method of Execution, after judging whether the
> status of the slot is Alive before cancel, repeating the above operation is
> still invalid, that is, RpcTimeout will still be triggered. My problem is:
> Active in the ResourceManager's closeTaskManagerConnection method, does not
> affect the state of its associated allocated slot. I think this is a bug. We
> should optimize the behavior of cancel to speed up the execution of cancel.
> !image-2022-01-27-10-55-14-758.png!!image-2022-01-27-10-55-59-119.png!
> !image-2022-01-27-10-57-26-223.png!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)