moonyoung created FLINK-38821:
---------------------------------

             Summary: Add TaskManager creation/deletion metrics to Active 
Resource Manager
                 Key: FLINK-38821
                 URL: https://issues.apache.org/jira/browse/FLINK-38821
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Metrics
         Environment: flink 1.15

java 8

k8s 1.30.12
            Reporter: moonyoung


h2. Background / Problem Description

We observed an issue in a Kubernetes-based Flink deployment where 
{*}TaskManagers are repeatedly created and deleted in a short period of time{*}.

This behavior appears as a rapid loop of TaskManager pod creation and 
termination. Based on our investigation, this is likely caused by a 
Kubernetes-related issue rather than Flink application logic.

 

During this incident, TaskManagers failed to register with the JobManager and 
exited early with the following error:
The environment variable _POD_NODE_ID is not set, which is used to identify the 
node where the task manager is located.
This error indicates that the TaskManager could not obtain the node name from 
the Kubernetes Downward API.
As a result:
 * The TaskManager process terminates during initialization

 * The TaskManager never registers with the JobManager

 * The Active Resource Manager considers the worker as “pending but 
unregistered”

This strongly suggests a transient or intermittent Kubernetes issue (e.g., 
Downward API failure, pod startup race, or node-related instability).

 
h2. Suggest

For now ActiveResourceManager only contains {{pendingWorkerCounter}} and 
{{totalWorkerCounter}} are {_}gauge-like state metrics{_}, they do not reflect:
 * How frequently TaskManagers are being created

 * How frequently TaskManagers are being removed or failing before registration

In high-churn scenarios, this makes it difficult to:
 * Detect instability caused by Kubernetes

 * Distinguish between “healthy but busy” and “unhealthy and flapping” clusters

 * Perform root cause analysis based on metrics alone

So I propose adding *counter-based metrics* to the Active Resource Manager, 
such as:
 * *TaskManager creation counter*

 ** Incremented whenever a TaskManager is requested / launched

 * *TaskManager removal (or failure) counter*

 ** Incremented whenever a pending or running TaskManager is removed before 
normal shutdown

These metrics would complement existing counters and provide *temporal 
visibility* into TaskManager churn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to