[jira] [Commented] (FLINK-37773) Extra TMs are started when Jobmanger is OOM killed in some FlinkDeployment runs

Nishant More (Jira) Tue, 03 Jun 2025 12:58:10 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955969#comment-17955969
 ]


Nishant More commented on FLINK-37773:
--------------------------------------

We have same issue in flink-1.18.1 when the JM restarts happen. We noticed that 
it went away when we disable fine grained slot manager. However, this needs to 
be investigated why the resources are not released when fine grained slot 
managemnet is enabled by default.

Here is the document that I created, but can't attach it . So just pasting here:



{*}Analysis of TaskManager Over-Provisioning in Flink 1.18.1 (Per-Job Mode on 
YARN){*}{*}{*}

 

{*}Issue Overview{*}{*}{*}

 

When running Flink 1.18.1 in per-job mode on a YARN cluster, we observed that 
restarting the Application Master (aJM) for a job led to an unexpected increase 
in the number of TaskManagers during recovery. After the job had recovered, 
these additional TaskManagers were sometimes not released, resulting in 
resource over-utilization and potentially impacting overall cluster efficiency 
and job scheduling.

{*}Observed Behavior{*}{*}{*}
 * On aJM Restart (Flink 1.18.1):
 * TaskManagers are re-requested to aid recovery.
 * After recovery: Some of these TaskManagers are not released automatically.
 * Impact: Persistent over-provisioning, leading to higher resource usage and 
YARN capacity pressure.


 * Previous Version (Flink 1.15.1):
 * No such lingering TaskManagers were observed upon recovery.
 * Resource usage stabilized after job recovery as expected.

{*}Steps To Reproduce The Issue{*}{*}{*}
 # Launch flink-1.18.1 job [parallelism is 120 and 120 containers are 
instantiated]: 
 # Once the job is started and all operators are in the running state, kill the 
active JM.
 # New JM is activated: 
 # From the logs, we can see 120 containers are recovered by default as a part 
of finegrained slot management:. 

|2025-06-02 05:26:30,497 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Recovered 120 workers from previous attempt.|

5However, the previous 74 containers are also recovered later: 

 
|2025-06-02 05:26:32,762 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - need 
request 71 new workers, current worker number 49, declared worker number 120|
 # Tried following approaches, 
 # https://issues.apache.org/jira/browse/FLINK-24713
 # https://issues.apache.org/jira/browse/FLINK-27576

But didn’t work for yarn

 

{*}Root Cause Investigation{*}{*}{*}
 * Slot Manager Changes:
 * Flink 1.15.1: Uses the declarative slot manager (default), responsible for 
managing TaskManager slots holistically, scaling up/down the number of 
TaskManagers as needed.
 * Flink 1.18.1: Default slot manager is fine-grained, allowing TaskManagers to 
have different resource configurations per slot (more flexibility, but also can 
cause resource fragmentation).


 * Testing the configuration:
 * Disabling fine-grained resource management in Flink 1.18.1:

|cluster.fine-grained-resource-management.enabled=false|
 * Result:
 * The issue was mitigated—TaskManagers were properly released post-recovery, 
behaving like Flink 1.15.1.

{*}Comparative Evidence{*}{*}{*}

A) Before Fix

_(Flink 1.18.1, fine-grained slot management enabled)_
 * Statsboard Graphs:
 *  
 *  TMs keeps on increasing container count every time the JM is restarted

 

 

 

B) After Fix

_(Flink 1.18.1, fine-grained slot management disabled)_
 * Statsboard Graphs:
 *  
 * Stabilized the container count even after restart.

 

 

 

{*}Resolution & Recommendation{*}{*}{*}
 * Immediate Mitigation:
Disabling fine-grained resource management in Flink 1.18.1 using:

|cluster.fine-grained-resource-management.enabled=false|

restores the previous resource release behavior and fixes the over-provisioning 
problem.
 * Long-term:
 * Monitor updates and bug reports from the Flink community regarding 
fine-grained slot management in per-job mode.
 * Revisit enabling fine-grained resource management once confirmed fixed or 
fully supported for our use case.

{*}Summary Table{*}{*}{*}
|{*}Flink Version{*}{*}{*}|{*}Slot Manager Type{*}{*}{*}|{*}Observed 
Behavior{*}{*}{*}|{*}Resource Release{*}{*}{*}|{*}Mitigation{*}{*}{*}|
|1.15.1|Declarative|Normal; no over-provisioning|Yes|N/A|
|1.18.1|Fine-Grained (default)|Extra TaskManagers linger after 
recovery|No|Disable fine-grained|
|1.18.1|Declarative (manual)|Normal; no over-provisioning|Yes|Set config false|

 

 

{*}References{*}{*}{*}
 * Issue Tracker:
 * Flink ticket already raised for similar issue on Kubernetes: 
https://issues.apache.org/jira/browse/FLINK-37773


 * Flink Release Notes: 
 * 
[https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/finegrained_resource/]
 
 * https://issues.apache.org/jira/browse/FLINK-31448 Use FineGrainedSlotManager 
as the default SlotManager

 

> Extra TMs are started when Jobmanger is OOM killed in some FlinkDeployment 
> runs
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-37773
>                 URL: https://issues.apache.org/jira/browse/FLINK-37773
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.10.0
>            Reporter: Nihar Rao
>            Priority: Major
>         Attachments: Screenshot 2025-04-24 at 3.29.47 PM.png, Screenshot 
> 2025-06-02 at 11.53.28 AM.png
>
>
> Hi,
> We are running into a weird issue with apache flink kubernetes operator 
> 1.10.0 and apache flink 1.19.1. We are running jobs using native kubernetes 
> application mode and FlinkDeployment CRD. We are running a job with 24 
> taskmanagers and 1 Jobmanager replica with HA enabled.
> Below is the chronological summary of events:
> 1. Job was initially started with 24 task managers.
> 2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl 
> describe pod <JM pod>}} shows the pod restarted due to OOM as well.
> 3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were 
> started and is confirmed on flink UI on available task slots section.
> 4. There was no impact on job (it restarted successfully) but there are 48 
> taskmanagers running out of which 24 are standby. The expected behaviour 
> after a JM OOM with HA enabled is no starting of new task managers.
> 5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot 
> and kubectl output below.
> I also checked the kubernetes operator pod logs and I don't find anything 
> that could explain this behaviour. This has happened few times now with 
> different jobs and we have tried purposely OOMkilling jobmanager for one of 
> our test jobs many times we haven't been aple to reproduce this behaviour. It 
> looks to be an edge case which is difficult to reproduce.
> Can you please help us on how to debug this as kubernetes operator don't show 
> any relevant information on why this happened. Thanks and let me know if you 
> need further information. 
> kubectl get pod ouput showing 24 extra TMs:
> NAME                                      READY   STATUS    RESTARTS      AGE
> ioi-quality-667f575877-btfkv              1/1     Running   1 (39h ago)   
> 4d16h
> ioi-quality-taskmanager-1-1               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-10              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-11              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-12              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-13              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-14              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-15              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-16              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-17              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-18              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-19              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-2               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-20              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-21              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-22              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-23              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-24              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-3               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-4               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-5               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-6               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-7               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-8               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-9               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-2-1               1/1     Running   0             39h
> ioi-quality-taskmanager-2-10              1/1     Running   0             39h
> ioi-quality-taskmanager-2-11              1/1     Running   0             39h
> ioi-quality-taskmanager-2-12              1/1     Running   0             39h
> ioi-quality-taskmanager-2-13              1/1     Running   0             39h
> ioi-quality-taskmanager-2-14              1/1     Running   0             39h
> ioi-quality-taskmanager-2-15              1/1     Running   0             39h
> ioi-quality-taskmanager-2-16              1/1     Running   0             39h
> ioi-quality-taskmanager-2-17              1/1     Running   0             39h
> ioi-quality-taskmanager-2-18              1/1     Running   0             39h
> ioi-quality-taskmanager-2-19              1/1     Running   0             39h
> ioi-quality-taskmanager-2-2               1/1     Running   0             39h
> ioi-quality-taskmanager-2-20              1/1     Running   0             39h
> ioi-quality-taskmanager-2-21              1/1     Running   0             39h
> ioi-quality-taskmanager-2-22              1/1     Running   0             39h
> ioi-quality-taskmanager-2-23              1/1     Running   0             39h
> ioi-quality-taskmanager-2-24              1/1     Running   0             39h
> ioi-quality-taskmanager-2-3               1/1     Running   0             39h
> ioi-quality-taskmanager-2-4               1/1     Running   0             39h
> ioi-quality-taskmanager-2-5               1/1     Running   0             39h
> ioi-quality-taskmanager-2-6               1/1     Running   0             39h
> ioi-quality-taskmanager-2-7               1/1     Running   0             39h
> ioi-quality-taskmanager-2-8               1/1     Running   0             39h
> ioi-quality-taskmanager-2-9               1/1     Running   0             39h
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37773) Extra TMs are started when Jobmanger is OOM killed in some FlinkDeployment runs

Reply via email to