1fanwang opened a new issue, #66799:
URL: https://github.com/apache/airflow/issues/66799

   ### Description
   
   `KubernetesExecutor`'s pod create / patch / delete calls in 
`providers/cncf/kubernetes/.../kubernetes_executor_utils.py` go straight to the 
k8s API client with no metric emission. When the upstream apiserver is slow 
(rate-limiting, etcd contention, network), operators see "scheduler stalling" 
but can't tell whether the bottleneck is the airflow scheduler loop, the 
executor's queue, or the k8s api itself.
   
   ### Use case / motivation
   
   Today, troubleshooting a slow KE deployment requires correlating airflow 
scheduler logs against the apiserver's own metrics — and even then you don't 
see per-status-code distributions (200 vs 429 vs 503) for each operation.
   
   ### Proposal
   
   Three timer metrics + three status-code counters around the existing K8s API 
call sites:
   
   | Metric | Type | Wraps |
   |---|---|---|
   | `executor.pod_creation` | timer | `_create_pod` (or equivalent create 
call) |
   | `executor.pod_deletion` | timer | `delete_pod` |
   | `executor.pod_patching` | timer | `patch_namespaced_pod` |
   | `executor.pod_creation_status` | counter, tagged by status code | same |
   | `executor.pod_deletion_status` | counter | same |
   | `executor.pod_patching_status` | counter | same |
   
   All additive. No behavioral change. Provider PR.
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's Code of Conduct
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to