1fanwang opened a new pull request, #66806:
URL: https://github.com/apache/airflow/pull/66806

   ### Problem
   
   The `KubernetesExecutor` calls `create_namespaced_pod`, 
`delete_namespaced_pod`, and `patch_namespaced_pod` against the API server on 
every task lifecycle event, but emits no metrics around those calls. When a 
cluster's control plane is slow, throttling (HTTP 429), or returning 5xx, the 
only signal today is scheduler log noise — there's no way to alert on latency 
drift or error-rate spikes without scraping logs.
   
   ### Fix
   
   Wrap each of the three pod API call sites in `kubernetes_executor_utils.py` 
with `Stats.timer` for latency (`kubernetes_executor.pod_creation` / 
`pod_deletion` / `pod_patching`) and a paired `Stats.incr` tagged by status 
(`pod_creation_status` / `pod_deletion_status` / `pod_patching_status`). The 
counter is tagged `status="200"` on success and with the `ApiException.status` 
value on failure, so operators can chart per-status-code rates. The 404-is-fine 
branch in `delete_pod` and the swallow-on-failure branches in the two patch 
methods still behave as before — they just emit a counter on the way out.
   
   The three new timers and three new counters are registered in 
`shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml`
 so they pass the metrics-registry pre-commit hook and show up in the published 
metrics docs.
   
   ### Tests
   
   New unit tests in `test_kubernetes_executor.py` mock the `Stats` module and 
assert the timer + tagged counter fire on both the success path and an 
`ApiException(status=429)` failure path for `delete_pod`.
   
   Closes #66799
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to