Nihar Rao created FLINK-37773:
---------------------------------

             Summary: Extra TMs are started when Jobmanger is OOM killed in 
some FlinkDeployment runs
                 Key: FLINK-37773
                 URL: https://issues.apache.org/jira/browse/FLINK-37773
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.10.0
            Reporter: Nihar Rao
         Attachments: Screenshot 2025-04-24 at 3.29.47 PM.png

Hi,

We are running into a weird issue with apache flink kubernetes operator 1.10.0 
and apache flink 1.19.1. We are running jobs using native kubernetes 
application mode and FlinkDeployment CRD. We are running a job with 24 
taskmanagers and 1 Jobmanager replica with HA enabled.

Below is the chronological summary of events:


1. Job was initially started with 24 task managers.

2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl 
describe pod <JM pod>}} shows the pod restarted due to OOM as well.

3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were 
started and is confirmed on flink UI on available task slots section.

4. There was no impact on job (it restarted successfully) but there are 48 
taskmanagers running out of which 24 are standby. The expected behaviour after 
a JM OOM with HA enabled is no starting of new task managers.

5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot 
and kubectl output below.

I also checked the kubernetes operator pod logs and I don't find anything that 
could explain this behaviour. This has happened few times now with different 
jobs and we have tried purposely OOMkilling jobmanager for one of our test jobs 
many times we haven't been aple to reproduce this behaviour. It looks to be an 
edge case which is difficult to reproduce.

Can you please help us on how to debug this as kubernetes operator don't show 
any relevant information on why this happened. Thanks and let me know if you 
need further information. 




kubectl get pod ouput showing 24 extra TMs:
oi-quality-667f575877-btfkv 1/1 Running 1 (39h ago) 4d16h 
ioi-quality-taskmanager-1-1 1/1 Running 0 4d16h ioi-quality-taskmanager-1-10 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-11 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-12 1/1 Running 0 4d16h ioi-quality-taskmanager-1-13 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-14 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-15 1/1 Running 0 4d16h ioi-quality-taskmanager-1-16 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-17 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-18 1/1 Running 0 4d16h ioi-quality-taskmanager-1-19 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-2 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-20 1/1 Running 0 4d16h ioi-quality-taskmanager-1-21 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-22 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-23 1/1 Running 0 4d16h ioi-quality-taskmanager-1-24 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-3 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-4 1/1 Running 0 4d16h ioi-quality-taskmanager-1-5 1/1 
Running 0 4d16h ioi-quality-taskmanager-1-6 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-7 1/1 Running 0 4d16h ioi-quality-taskmanager-1-8 1/1 
Running 0 4d16h ioi-quality-taskmanager-1-9 1/1 Running 0 4d16h 
ioi-quality-taskmanager-2-1 1/1 Running 0 39h ioi-quality-taskmanager-2-10 1/1 
Running 0 39h ioi-quality-taskmanager-2-11 1/1 Running 0 39h 
ioi-quality-taskmanager-2-12 1/1 Running 0 39h ioi-quality-taskmanager-2-13 1/1 
Running 0 39h ioi-quality-taskmanager-2-14 1/1 Running 0 39h 
ioi-quality-taskmanager-2-15 1/1 Running 0 39h ioi-quality-taskmanager-2-16 1/1 
Running 0 39h ioi-quality-taskmanager-2-17 1/1 Running 0 39h 
ioi-quality-taskmanager-2-18 1/1 Running 0 39h ioi-quality-taskmanager-2-19 1/1 
Running 0 39h ioi-quality-taskmanager-2-2 1/1 Running 0 39h 
ioi-quality-taskmanager-2-20 1/1 Running 0 39h ioi-quality-taskmanager-2-21 1/1 
Running 0 39h ioi-quality-taskmanager-2-22 1/1 Running 0 39h 
ioi-quality-taskmanager-2-23 1/1 Running 0 39h ioi-quality-taskmanager-2-24 1/1 
Running 0 39h ioi-quality-taskmanager-2-3 1/1 Running 0 39h 
ioi-quality-taskmanager-2-4 1/1 Running 0 39h ioi-quality-taskmanager-2-5 1/1 
Running 0 39h ioi-quality-taskmanager-2-6 1/1 Running 0 39h 
ioi-quality-taskmanager-2-7 1/1 Running 0 39h ioi-quality-taskmanager-2-8 1/1 
Running 0 39h ioi-quality-taskmanager-2-9 1/1 Running 0 39h
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to