paramjeet01 opened a new issue, #39096:
URL: https://github.com/apache/airflow/issues/39096
### Apache Airflow version
Other Airflow 2 version (please specify below)
### If "Other Airflow 2 version" selected, which one?
2.8.3
### What happened?
The airflow terminated the task, scheduling it for a retry. However, during
the subsequent retry attempt, an error occurred indicating that the pod with
identical labels still persisted. Upon inspection, I found the pods were still
active from the initial attempt.
**First attempt error log:**
```
[2024-04-18, 01:30:40 IST] {local_task_job_runner.py:121} ERROR - Received
SIGTERM. Terminating subprocesses
[2024-04-18, 01:30:40 IST] {process_utils.py:131} INFO - Sending 15 to group
125. PIDs of all processes in the group: [125]
[2024-04-18, 01:30:40 IST] {process_utils.py:86} INFO - Sending the signal
15 to group 125
[2024-04-18, 01:30:40 IST] {taskinstance.py:2483} ERROR - Received SIGTERM.
Terminating subprocesses.
[2024-04-18, 01:30:40 IST] {taskinstance.py:2731} ERROR - Task failed with
exception
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line
157, in execute
self.pod_manager.fetch_requested_container_logs(
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
line 580, in fetch_requested_container_logs
status = self.fetch_container_logs(pod=pod, container_name=c,
follow=follow_logs)
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
line 503, in fetch_container_logs
last_log_time, exc = consume_logs(since_time=last_log_time)
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
line 450, in consume_logs
for raw_line in logs:
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
line 240, in __iter__
for data_chunk in self.response.stream(amt=None, decode_content=True):
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
933, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
1073, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
1001, in _update_chunk_length
line = self._fp.fp.readline() # type: ignore[union-attr]
File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.10/ssl.py", line 1307, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.10/ssl.py", line 1163, in read
return self._sslobj.read(len, buffer)
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py",
line 2485, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py",
line 439, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py",
line 414, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line
184, in execute
raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: Task received
SIGTERM signal
[2024-04-18, 01:30:40 IST] {taskinstance.py:527} DEBUG - Task Duration set
to 10.25225
[2024-04-18, 01:30:40 IST] {taskinstance.py:549} DEBUG - Clearing
next_method and next_kwargs.
[2024-04-18, 01:30:40 IST] {taskinstance.py:1149} INFO - Marking task as
UP_FOR_RETRY.
```
**Second attempt error log:**
```
[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no
attribute 'metadata'
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line
153, in execute
self.remote_pod = self.find_pod(self.pod.metadata.namespace,
context=context)
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
line 523, in find_pod
raise AirflowException(f"More than one pod running with labels
{label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels
{**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line
184, in execute
raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: More than one pod
running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
line 937, in patch_already_checked
name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'
[2024-04-18, 01:32:20 IST] {taskinstance.py:2731} ERROR - Task failed with
exception
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line
153, in execute
self.remote_pod = self.find_pod(self.pod.metadata.namespace,
context=context)
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
line 523, in find_pod
raise AirflowException(f"More than one pod running with labels
{label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels
{**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line
184, in execute
raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: {**** our labels
*****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py",
line 439, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py",
line 414, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line
186, in execute
self.cleanup(
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
line 839, in cleanup
raise AirflowException(
airflow.exceptions.AirflowException: Pod { *** our pod name ***} returned a
failure.
remote_pod: None
[2024-04-18, 01:32:20 IST] {taskinstance.py:527} DEBUG - Task Duration set
to 1.162314
[2024-04-18, 01:32:20 IST] {taskinstance.py:549} DEBUG - Clearing
next_method and next_kwargs.
[2024-04-18, 01:32:20 IST] {taskinstance.py:1149} INFO - Marking task as
UP_FOR_RETRY.
[2024-04-18, 01:32:20 IST] {plugins.py:178} INFO - Getting Refined Message
[2024-04-18, 01:32:20 IST] {plugins.py:180} INFO - Message Payload Not
Provided
[2024-04-18, 01:32:20 IST] {logging_mixin.py:188} INFO - {'Airflow
Exception': 'Pod Launching failed'}
[2024-04-18, 01:32:20 IST] {plugins.py:183} INFO - Task message: {'Airflow
Exception': 'Pod Launching failed'}
```
### What you think should happen instead?
Once the SIGTERM Terminating subprocesses is issued to the task it should
properly delete the pod.
### How to reproduce
Let airflow kill your task with SIGTERM and on the next retry you'll face
pod already exists with same labels
### Operating System
Amazon Linux 2
### Versions of Apache Airflow Providers
pytest>=6.2.5
docker>=5.0.0
crypto>=1.4.1
cryptography>=3.4.7
pyOpenSSL>=20.0.1
ndg-httpsclient>=0.5.1
boto3>=1.34.0
sqlalchemy
redis>=3.5.3
requests>=2.26.0
pysftp>=0.2.9
werkzeug>=1.0.1
apache-airflow-providers-cncf-kubernetes==8.0.0
apache-airflow-providers-amazon>=8.13.0
psycopg2>=2.8.5
grpcio>=1.37.1
grpcio-tools>=1.37.1
protobuf>=3.15.8,<=3.21
python-dateutil>=2.8.2
jira>=3.1.1
confluent_kafka>=1.7.0
pyarrow>=10.0.1,<10.1.0
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
Official helm chart deployment
### Anything else?
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]