Nihar Rao created FLINK-37773: --------------------------------- Summary: Extra TMs are started when Jobmanger is OOM killed in some FlinkDeployment runs Key: FLINK-37773 URL: https://issues.apache.org/jira/browse/FLINK-37773 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: 1.10.0 Reporter: Nihar Rao Attachments: Screenshot 2025-04-24 at 3.29.47 PM.png
Hi, We are running into a weird issue with apache flink kubernetes operator 1.10.0 and apache flink 1.19.1. We are running jobs using native kubernetes application mode and FlinkDeployment CRD. We are running a job with 24 taskmanagers and 1 Jobmanager replica with HA enabled. Below is the chronological summary of events: 1. Job was initially started with 24 task managers. 2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl describe pod <JM pod>}} shows the pod restarted due to OOM as well. 3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were started and is confirmed on flink UI on available task slots section. 4. There was no impact on job (it restarted successfully) but there are 48 taskmanagers running out of which 24 are standby. The expected behaviour after a JM OOM with HA enabled is no starting of new task managers. 5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot and kubectl output below. I also checked the kubernetes operator pod logs and I don't find anything that could explain this behaviour. This has happened few times now with different jobs and we have tried purposely OOMkilling jobmanager for one of our test jobs many times we haven't been aple to reproduce this behaviour. It looks to be an edge case which is difficult to reproduce. Can you please help us on how to debug this as kubernetes operator don't show any relevant information on why this happened. Thanks and let me know if you need further information. kubectl get pod ouput showing 24 extra TMs: oi-quality-667f575877-btfkv 1/1 Running 1 (39h ago) 4d16h ioi-quality-taskmanager-1-1 1/1 Running 0 4d16h ioi-quality-taskmanager-1-10 1/1 Running 0 4d16h ioi-quality-taskmanager-1-11 1/1 Running 0 4d16h ioi-quality-taskmanager-1-12 1/1 Running 0 4d16h ioi-quality-taskmanager-1-13 1/1 Running 0 4d16h ioi-quality-taskmanager-1-14 1/1 Running 0 4d16h ioi-quality-taskmanager-1-15 1/1 Running 0 4d16h ioi-quality-taskmanager-1-16 1/1 Running 0 4d16h ioi-quality-taskmanager-1-17 1/1 Running 0 4d16h ioi-quality-taskmanager-1-18 1/1 Running 0 4d16h ioi-quality-taskmanager-1-19 1/1 Running 0 4d16h ioi-quality-taskmanager-1-2 1/1 Running 0 4d16h ioi-quality-taskmanager-1-20 1/1 Running 0 4d16h ioi-quality-taskmanager-1-21 1/1 Running 0 4d16h ioi-quality-taskmanager-1-22 1/1 Running 0 4d16h ioi-quality-taskmanager-1-23 1/1 Running 0 4d16h ioi-quality-taskmanager-1-24 1/1 Running 0 4d16h ioi-quality-taskmanager-1-3 1/1 Running 0 4d16h ioi-quality-taskmanager-1-4 1/1 Running 0 4d16h ioi-quality-taskmanager-1-5 1/1 Running 0 4d16h ioi-quality-taskmanager-1-6 1/1 Running 0 4d16h ioi-quality-taskmanager-1-7 1/1 Running 0 4d16h ioi-quality-taskmanager-1-8 1/1 Running 0 4d16h ioi-quality-taskmanager-1-9 1/1 Running 0 4d16h ioi-quality-taskmanager-2-1 1/1 Running 0 39h ioi-quality-taskmanager-2-10 1/1 Running 0 39h ioi-quality-taskmanager-2-11 1/1 Running 0 39h ioi-quality-taskmanager-2-12 1/1 Running 0 39h ioi-quality-taskmanager-2-13 1/1 Running 0 39h ioi-quality-taskmanager-2-14 1/1 Running 0 39h ioi-quality-taskmanager-2-15 1/1 Running 0 39h ioi-quality-taskmanager-2-16 1/1 Running 0 39h ioi-quality-taskmanager-2-17 1/1 Running 0 39h ioi-quality-taskmanager-2-18 1/1 Running 0 39h ioi-quality-taskmanager-2-19 1/1 Running 0 39h ioi-quality-taskmanager-2-2 1/1 Running 0 39h ioi-quality-taskmanager-2-20 1/1 Running 0 39h ioi-quality-taskmanager-2-21 1/1 Running 0 39h ioi-quality-taskmanager-2-22 1/1 Running 0 39h ioi-quality-taskmanager-2-23 1/1 Running 0 39h ioi-quality-taskmanager-2-24 1/1 Running 0 39h ioi-quality-taskmanager-2-3 1/1 Running 0 39h ioi-quality-taskmanager-2-4 1/1 Running 0 39h ioi-quality-taskmanager-2-5 1/1 Running 0 39h ioi-quality-taskmanager-2-6 1/1 Running 0 39h ioi-quality-taskmanager-2-7 1/1 Running 0 39h ioi-quality-taskmanager-2-8 1/1 Running 0 39h ioi-quality-taskmanager-2-9 1/1 Running 0 39h -- This message was sent by Atlassian Jira (v8.20.10#820010)