florian-meyrueis-al opened a new issue, #60527:
URL: https://github.com/apache/airflow/issues/60527

   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   version 10.5.0.
   
   Assuming it affects all version above
   
   ### Apache Airflow version
   
   2.11.0, 3.x not tested but it's not an airflow core issue, so should be 
affected too
   
   ### Operating System
   
   Debian
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   In the scheduler logs we receive a series of errors like : 
   
   
   ```
   2026-01-11 19:11:33.092 | [2026-01-11T19:11:33.091+0000] 
{kubernetes_executor_utils.py:98} ERROR - Unknown error in 
KubernetesJobWatcher. Failing |  
   -- | -- | --
     |   | 2026-01-11 19:11:33.092 | Traceback (most recent call last): |  
     |   | 2026-01-11 19:11:33.092 | File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 91, in run |  
     |   | 2026-01-11 19:11:33.092 | self.resource_version = self._run( |  
     |   | 2026-01-11 19:11:33.092 | ^^^^^^^^^^ |  
     |   | 2026-01-11 19:11:33.092 | File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 171, in _run |  
     |   | 2026-01-11 19:11:33.092 | self.process_status( |  
     |   | 2026-01-11 19:11:33.092 | File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 249, in process_status |  
     |   | 2026-01-11 19:11:33.092 | 
container_status_state["waiting"]["reason"] |  
     |   | 2026-01-11 19:11:33.092 | 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ |  
     |   | 2026-01-11 19:11:33.092 | KeyError: 'reason' |  
     |   | 2026-01-11 19:11:33.093 | Process KubernetesJobWatcher-3: |  
     |   | 2026-01-11 19:11:33.093 | Traceback (most recent call last): |  
     |   | 2026-01-11 19:11:33.093 | File 
"/usr/local/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap 
|  
     |   | 2026-01-11 19:11:33.093 | self.run() |  
     |   | 2026-01-11 19:11:33.093 | File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 91, in run |  
     |   | 2026-01-11 19:11:33.094 | self.resource_version = self._run( |  
     |   | 2026-01-11 19:11:33.094 | ^^^^^^^^^^ |  
     |   | 2026-01-11 19:11:33.094 | File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 171, in _run |  
     |   | 2026-01-11 19:11:33.094 | self.process_status( |  
     |   | 2026-01-11 19:11:33.094 | File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 249, in process_status |  
     |   | 2026-01-11 19:11:33.094 | 
container_status_state["waiting"]["reason"] |  
     |   | 2026-01-11 19:11:33.094 | 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ |  
     |   | 2026-01-11 19:11:33.094 | KeyError: 'reason'
   
   ```
   
   At this hour most of our dags are started and run for the all night.
   
   In the morning, our monitoring of available slots showed this : 
   
   <img width="803" height="260" alt="Image" 
src="https://github.com/user-attachments/assets/541877af-f0bc-4166-a63e-60e0a39bc72a";
 />
   
   where the blue line is opened execution slots and green line is running 
execution slots.
   
   The problem is, at that hour (08:00 and after), no dags were running anymore 
on airflow. Our only solution was to restart the scheduler to get all our 
opened slots available
   
   ### What you think should happen instead
   
   The kubernetesjobwatcher should not have crashed because a problem of 
missing key in the K8s api response and the opened slot should all  have been 
released properly at the end of the dags.
   
   ### How to reproduce
   
   I don't know.
   
   ### Anything else
   
   The code of the kubernetes providers should handle correctly optional keys 
answer from kubernetes.
   It this case, k8s api do not enforce "reason" and "message" keys as required 
in the specification of the ContainerStateWaiting object .
   
   
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to