paramjeet01 commented on issue #39096:
URL: https://github.com/apache/airflow/issues/39096#issuecomment-2118726874

   I believe I have identified the cause of the issue:
   
   We are using AWS Spot EC2 instances for the workloads in Airflow. When a 
spot instance is terminated, the pod enters a terminating state for around 2 
minutes. During the second retry, the pod is rescheduled, and the 
[find_pod](https://github.com/apache/airflow/blob/2.8.3/airflow/providers/cncf/kubernetes/operators/pod.py#L535)
 method is used to retrieve the pod based on the labels, which results in the 
following error:
   ```
   [2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no 
attribute 'metadata'
   Traceback (most recent call last):
     File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 
153, in execute
       self.remote_pod = self.find_pod(self.pod.metadata.namespace, 
context=context)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
 line 523, in find_pod
       raise AirflowException(f"More than one pod running with labels 
{label_selector}")
   airflow.exceptions.AirflowException: More than one pod running with labels 
{**** our labels *****}
   ```
   At this point, we have a pod in a terminating state and a new pod created by 
the second retry. When the 
[cleanup](https://github.com/apache/airflow/blob/2.8.3/airflow/providers/cncf/kubernetes/operators/pod.py#L633)
 method is called, it encounters another error because the find_pod method did 
not return anything due to the exception:
   ```
   During handling of the above exception, another exception occurred:
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
 line 937, in patch_already_checked
       name=pod.metadata.name,
   AttributeError: 'NoneType' object has no attribute 'metadata'
   ```
   After every retry a new pod is created and not cleaned up which loops 
forever.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to