[jira] [Updated] (FLINK-37813) JobManager failover during allocation slots causes ResourceManager to release unwanted TaskManager failure

Baozhu Zhao (Jira) Mon, 19 May 2025 06:34:05 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Baozhu Zhao updated FLINK-37813:
--------------------------------
    Description: 
environment ：
 * version : 1.17
 * resource provider:
 * job desc: The job parallelism=27 ， slotPerWorker=10，need 3 worker
 * job config：cluster.fine-grained-resource-management.enabled=true

 

issue Desc：
 * When jobmanager failover, the SlotReport of the registered taskmanager did 
not meet expectations, resulting in ResourceManager unable to release the free 
taskmanager.

 

Reproduce steps：

1、Killing a taskManager causes the job to fail, and the slot manager will 
reallocate the slot to the existing taskManagers. Before the slot allocation is 
completed, killing the jobmanager and put the job in a SUSPEND state.  
[^old-tm.log]

2、After the new JobManager is launched, the existing taskmanager will register, 
and the slotNum in the slotReport reported by the existing task manager will be 
larger than slotPerWorker. Causing the `ActiveResourceManager` to fail to 
correctly calculate the 'releaseOrRequestWorkerNumber' when checking and 
releasing idle task managers on a scheduled basis.   [^new-tm.log]

 

!注册的tm.png|width=372,height=190!

 

 

  was:
环境描述：

Flink on k8s 运行环境

作业需要3个taskmanager，单个taskmanager 
10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true`

问题描述：

jobmanager failover 后，注册的taskmanager slot report 不符合预期，导致闲置的taskmanager 无法被释放

 

复现步骤：

1、杀死某个 taskmanager，导致作业failover，slot manager 会重新allocate slot 
到存量taskmanager,在slot 分配完成前，杀死 jobmanager ，作业会进入suspending 状态。[^old-tm.log]

2、新的JM 启动后，存量taskmanager 会注册，此时存量taskmanager注册的slotReport ，slot num 
会比正常的taskmanager 多。导致resourcemanager 在定时检查并release闲置taskmanager 
时，无法正确计算`releaseOrRequestWorkerNumber`,闲置的taskmanager 被释放。[^new-tm.log]

 

!注册的tm.png|width=372,height=190!

 

 

    Environment:     (was: 环境描述：

Flink on k8s 运行环境

Flink 版本 1.17

作业需要3个taskmanager，单个taskmanager 
10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true`)

> JobManager failover during allocation slots causes ResourceManager to release 
> unwanted TaskManager failure
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-37813
>                 URL: https://issues.apache.org/jira/browse/FLINK-37813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.17.2
>            Reporter: Baozhu Zhao
>            Priority: Major
>         Attachments: new-tm.log, old-tm.log, 注册的tm.png
>
>
> environment ：
>  * version : 1.17
>  * resource provider:
>  * job desc: The job parallelism=27 ， slotPerWorker=10，need 3 worker
>  * job config：cluster.fine-grained-resource-management.enabled=true
>  
> issue Desc：
>  * When jobmanager failover, the SlotReport of the registered taskmanager did 
> not meet expectations, resulting in ResourceManager unable to release the 
> free taskmanager.
>  
> Reproduce steps：
> 1、Killing a taskManager causes the job to fail, and the slot manager will 
> reallocate the slot to the existing taskManagers. Before the slot allocation 
> is completed, killing the jobmanager and put the job in a SUSPEND state.  
> [^old-tm.log]
> 2、After the new JobManager is launched, the existing taskmanager will 
> register, and the slotNum in the slotReport reported by the existing task 
> manager will be larger than slotPerWorker. Causing the 
> `ActiveResourceManager` to fail to correctly calculate the 
> 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers 
> on a scheduled basis.   [^new-tm.log]
>  
> !注册的tm.png|width=372,height=190!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-37813) JobManager failover during allocation slots causes ResourceManager to release unwanted TaskManager failure

Reply via email to