Nihar Rao created FLINK-37773:
---------------------------------
Summary: Extra TMs are started when Jobmanger is OOM killed in
some FlinkDeployment runs
Key: FLINK-37773
URL: https://issues.apache.org/jira/browse/FLINK-37773
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.10.0
Reporter: Nihar Rao
Attachments: Screenshot 2025-04-24 at 3.29.47 PM.png
Hi,
We are running into a weird issue with apache flink kubernetes operator 1.10.0
and apache flink 1.19.1. We are running jobs using native kubernetes
application mode and FlinkDeployment CRD. We are running a job with 24
taskmanagers and 1 Jobmanager replica with HA enabled.
Below is the chronological summary of events:
1. Job was initially started with 24 task managers.
2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl
describe pod <JM pod>}} shows the pod restarted due to OOM as well.
3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were
started and is confirmed on flink UI on available task slots section.
4. There was no impact on job (it restarted successfully) but there are 48
taskmanagers running out of which 24 are standby. The expected behaviour after
a JM OOM with HA enabled is no starting of new task managers.
5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot
and kubectl output below.
I also checked the kubernetes operator pod logs and I don't find anything that
could explain this behaviour. This has happened few times now with different
jobs and we have tried purposely OOMkilling jobmanager for one of our test jobs
many times we haven't been aple to reproduce this behaviour. It looks to be an
edge case which is difficult to reproduce.
Can you please help us on how to debug this as kubernetes operator don't show
any relevant information on why this happened. Thanks and let me know if you
need further information.
kubectl get pod ouput showing 24 extra TMs:
oi-quality-667f575877-btfkv 1/1 Running 1 (39h ago) 4d16h
ioi-quality-taskmanager-1-1 1/1 Running 0 4d16h ioi-quality-taskmanager-1-10
1/1 Running 0 4d16h ioi-quality-taskmanager-1-11 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-12 1/1 Running 0 4d16h ioi-quality-taskmanager-1-13
1/1 Running 0 4d16h ioi-quality-taskmanager-1-14 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-15 1/1 Running 0 4d16h ioi-quality-taskmanager-1-16
1/1 Running 0 4d16h ioi-quality-taskmanager-1-17 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-18 1/1 Running 0 4d16h ioi-quality-taskmanager-1-19
1/1 Running 0 4d16h ioi-quality-taskmanager-1-2 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-20 1/1 Running 0 4d16h ioi-quality-taskmanager-1-21
1/1 Running 0 4d16h ioi-quality-taskmanager-1-22 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-23 1/1 Running 0 4d16h ioi-quality-taskmanager-1-24
1/1 Running 0 4d16h ioi-quality-taskmanager-1-3 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-4 1/1 Running 0 4d16h ioi-quality-taskmanager-1-5 1/1
Running 0 4d16h ioi-quality-taskmanager-1-6 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-7 1/1 Running 0 4d16h ioi-quality-taskmanager-1-8 1/1
Running 0 4d16h ioi-quality-taskmanager-1-9 1/1 Running 0 4d16h
ioi-quality-taskmanager-2-1 1/1 Running 0 39h ioi-quality-taskmanager-2-10 1/1
Running 0 39h ioi-quality-taskmanager-2-11 1/1 Running 0 39h
ioi-quality-taskmanager-2-12 1/1 Running 0 39h ioi-quality-taskmanager-2-13 1/1
Running 0 39h ioi-quality-taskmanager-2-14 1/1 Running 0 39h
ioi-quality-taskmanager-2-15 1/1 Running 0 39h ioi-quality-taskmanager-2-16 1/1
Running 0 39h ioi-quality-taskmanager-2-17 1/1 Running 0 39h
ioi-quality-taskmanager-2-18 1/1 Running 0 39h ioi-quality-taskmanager-2-19 1/1
Running 0 39h ioi-quality-taskmanager-2-2 1/1 Running 0 39h
ioi-quality-taskmanager-2-20 1/1 Running 0 39h ioi-quality-taskmanager-2-21 1/1
Running 0 39h ioi-quality-taskmanager-2-22 1/1 Running 0 39h
ioi-quality-taskmanager-2-23 1/1 Running 0 39h ioi-quality-taskmanager-2-24 1/1
Running 0 39h ioi-quality-taskmanager-2-3 1/1 Running 0 39h
ioi-quality-taskmanager-2-4 1/1 Running 0 39h ioi-quality-taskmanager-2-5 1/1
Running 0 39h ioi-quality-taskmanager-2-6 1/1 Running 0 39h
ioi-quality-taskmanager-2-7 1/1 Running 0 39h ioi-quality-taskmanager-2-8 1/1
Running 0 39h ioi-quality-taskmanager-2-9 1/1 Running 0 39h
--
This message was sent by Atlassian Jira
(v8.20.10#820010)