[ https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Baozhu Zhao updated FLINK-37813: -------------------------------- Description: environment : * version : 1.17 * resource provider: * job desc: The job parallelism=27 , slotPerWorker=10,need 3 worker * job config:cluster.fine-grained-resource-management.enabled=true issue Desc: * When jobmanager failover, the SlotReport of the registered taskmanager did not meet expectations, resulting in ResourceManager unable to release the free taskmanager. Reproduce steps: 1、Killing a taskManager causes the job to fail, and the slot manager will reallocate the slot to the existing taskManagers. Before the slot allocation is completed, killing the jobmanager and put the job in a SUSPEND state. [^old-tm.log] 2、After the new JobManager is launched, the existing taskmanager will register, and the slotNum in the slotReport reported by the existing task manager will be larger than slotPerWorker. Causing the `ActiveResourceManager` to fail to correctly calculate the 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers on a scheduled basis. [^new-tm.log] !注册的tm.png|width=372,height=190! was: 环境描述: Flink on k8s 运行环境 作业需要3个taskmanager,单个taskmanager 10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true` 问题描述: jobmanager failover 后,注册的taskmanager slot report 不符合预期,导致闲置的taskmanager 无法被释放 复现步骤: 1、杀死某个 taskmanager,导致作业failover,slot manager 会重新allocate slot 到存量taskmanager,在slot 分配完成前,杀死 jobmanager ,作业会进入suspending 状态。[^old-tm.log] 2、新的JM 启动后,存量taskmanager 会注册,此时存量taskmanager注册的slotReport ,slot num 会比正常的taskmanager 多。导致resourcemanager 在定时检查并release闲置taskmanager 时,无法正确计算`releaseOrRequestWorkerNumber`,闲置的taskmanager 被释放。[^new-tm.log] !注册的tm.png|width=372,height=190! Environment: (was: 环境描述: Flink on k8s 运行环境 Flink 版本 1.17 作业需要3个taskmanager,单个taskmanager 10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true`) > JobManager failover during allocation slots causes ResourceManager to release > unwanted TaskManager failure > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-37813 > URL: https://issues.apache.org/jira/browse/FLINK-37813 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.17.2 > Reporter: Baozhu Zhao > Priority: Major > Attachments: new-tm.log, old-tm.log, 注册的tm.png > > > environment : > * version : 1.17 > * resource provider: > * job desc: The job parallelism=27 , slotPerWorker=10,need 3 worker > * job config:cluster.fine-grained-resource-management.enabled=true > > issue Desc: > * When jobmanager failover, the SlotReport of the registered taskmanager did > not meet expectations, resulting in ResourceManager unable to release the > free taskmanager. > > Reproduce steps: > 1、Killing a taskManager causes the job to fail, and the slot manager will > reallocate the slot to the existing taskManagers. Before the slot allocation > is completed, killing the jobmanager and put the job in a SUSPEND state. > [^old-tm.log] > 2、After the new JobManager is launched, the existing taskmanager will > register, and the slotNum in the slotReport reported by the existing task > manager will be larger than slotPerWorker. Causing the > `ActiveResourceManager` to fail to correctly calculate the > 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers > on a scheduled basis. [^new-tm.log] > > !注册的tm.png|width=372,height=190! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)