moonyoung created FLINK-38821:
---------------------------------
Summary: Add TaskManager creation/deletion metrics to Active
Resource Manager
Key: FLINK-38821
URL: https://issues.apache.org/jira/browse/FLINK-38821
Project: Flink
Issue Type: Improvement
Components: Runtime / Metrics
Environment: flink 1.15
java 8
k8s 1.30.12
Reporter: moonyoung
h2. Background / Problem Description
We observed an issue in a Kubernetes-based Flink deployment where
{*}TaskManagers are repeatedly created and deleted in a short period of time{*}.
This behavior appears as a rapid loop of TaskManager pod creation and
termination. Based on our investigation, this is likely caused by a
Kubernetes-related issue rather than Flink application logic.
During this incident, TaskManagers failed to register with the JobManager and
exited early with the following error:
The environment variable _POD_NODE_ID is not set, which is used to identify the
node where the task manager is located.
This error indicates that the TaskManager could not obtain the node name from
the Kubernetes Downward API.
As a result:
* The TaskManager process terminates during initialization
* The TaskManager never registers with the JobManager
* The Active Resource Manager considers the worker as “pending but
unregistered”
This strongly suggests a transient or intermittent Kubernetes issue (e.g.,
Downward API failure, pod startup race, or node-related instability).
h2. Suggest
For now ActiveResourceManager only contains {{pendingWorkerCounter}} and
{{totalWorkerCounter}} are {_}gauge-like state metrics{_}, they do not reflect:
* How frequently TaskManagers are being created
* How frequently TaskManagers are being removed or failing before registration
In high-churn scenarios, this makes it difficult to:
* Detect instability caused by Kubernetes
* Distinguish between “healthy but busy” and “unhealthy and flapping” clusters
* Perform root cause analysis based on metrics alone
So I propose adding *counter-based metrics* to the Active Resource Manager,
such as:
* *TaskManager creation counter*
** Incremented whenever a TaskManager is requested / launched
* *TaskManager removal (or failure) counter*
** Incremented whenever a pending or running TaskManager is removed before
normal shutdown
These metrics would complement existing counters and provide *temporal
visibility* into TaskManager churn.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)