Re: [PR] Add Task-Level Resource Usage Metrics (CPU and Memory) [airflow]

via GitHub Tue, 21 Oct 2025 20:46:27 -0700


HsiuChuanHsu commented on code in PR #56690:
URL: https://github.com/apache/airflow/pull/56690#discussion_r2450343945



##########
airflow-core/docs/administration-and-deployment/logging-monitoring/metrics.rst:
##########
@@ -254,6 +254,8 @@ Name                                                 
Description
 ``pool.scheduled_slots``                             Number of scheduled slots 
in the pool. Metric with pool_name tagging.
 ``pool.starving_tasks.<pool_name>``                  Number of starving tasks 
in the pool
 ``pool.starving_tasks``                              Number of starving tasks 
in the pool. Metric with pool_name tagging.
+``task.cpu_usage_percent.<dag_id>.<task_id>``        CPU usage percentage of a 
task

Review Comment:
   Thanks for all your feedback!
   IMO, when we got large number of DAGs and Tasks, we need to define the most 
efficient granularity for monitoring data.
   
   For me, it would be more efficient just to focus on the Task-Level as the 
finest-grained unit for core monitoring. The combination of 
`<dag_id>.<task_id>` provides the most practical and efficient level for 
routine data monitoring.
   
   > Possibly reporting the stats on individual instances as gauge will produce 
a high cardinality statistics. Possibly the cardinality there is not "too high" 
if we do it per individual tis. But I am not sure.
   
   If we trying to deep down to `try_id`, `map_index` level, that will likely 
result in a very high-cardinality metric set.
   My perspective is that this level of detail would be too high and 
potentially inefficient for large-scale monitoring. 
   But not sure of others.
   
   > When we are using gauge, only the last one counts, and previous values are 
replaced by the following ones - so effectively what we have is the valus in 
last execution of the "primary key". Not sure what is the best approach here.
   
   I think recording the value from the last execution of the same primary key 
(`<dag_id>.<task_id>`) should be sufficient. When using time-series monitoring 
tools (e.g., Prometheus) that automatically collect records with a timestamp, 
there is no effort to trace back past data based on the primary key.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add Task-Level Resource Usage Metrics (CPU and Memory) [airflow]

Reply via email to