[
https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weihua Hu resolved FLINK-37813.
-------------------------------
Resolution: Fixed
master: b0b6669c5f989a53e3aa8763a7b3f91e7c35b7b7
1.20: c595cdd7179f456dd2e7c02b3943d3a3d7c891b3
> SlotManager re-allocation slots upon failover causes ResourceManager start
> more TaskManager and release unwanted TaskManager failure
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-37813
> URL: https://issues.apache.org/jira/browse/FLINK-37813
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.17.2, 1.19.2
> Reporter: xingsuo-zbz
> Assignee: Weihua Hu
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2025-09-08-22-19-26-427.png,
> image-2025-09-08-22-22-20-365.png, new-jm.log, old-jm.log,
> re-allocate-slot.png, slot-report.png, 注册的tm.png
>
>
> environment :
> * version : 1.17
> * resource provider:
> * job desc: The job parallelism=27 , slotPerWorker=10,need 3 worker
> * job config:cluster.fine-grained-resource-management.enabled=true
>
> issue Desc:
> * When jobmanager failover, the SlotReport of the registered taskmanager did
> not meet expectations, resulting in ResourceManager unable to release the
> free taskmanager.
>
> Reproduce steps:
> 1、Killing a taskManager causes the job to fail, and the slot manager will
> reallocate the slot to the existing taskManagers. Before the slot allocation
> is completed, killing the jobmanager and put the job in a SUSPEND state.There
> is a probability that `
> FineGrainedSlotManager` will call method `declareNeededResources()` again to
> allocate slots after releasing them.[^old-jm.log]
> ^!re-allocate-slot.png|width=1152,height=501!^
> 2、After the new JobManager is launched, the existing taskmanager will
> register, and the slotNum in the slotReport reported by the existing task
> manager will be larger than slotPerWorker. Causing the
> `ActiveResourceManager` to fail to correctly calculate the
> 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers
> on a scheduled basis. [^new-jm.log]
> !slot-report.png|width=991,height=554!
> !注册的tm.png|width=372,height=190!
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)