rubanolha opened a new issue, #37041: URL: https://github.com/apache/airflow/issues/37041
### Apache Airflow version Other Airflow 2 version (please specify below) ### If "Other Airflow 2 version" selected, which one? 2.7.3 ### What happened? Note: our executor: KubernetesExecutor, task retries =1 I encountered an issue several times where a task was terminated externally at the beginning of execution, marked as failed, and not retried, despite having the retry parameter set to 1. The UI displays only one attempt, but upon further investigation, I observed discrepancies in the logs, Airflow database, and metrics. Upon checking the logs, Airflow database, and metrics, it became evident that there are two records in the task_fail table and the airflow.scheduler.tasks.killed_externally metric. public.task_fail: id,task_id,dag_id,start_date,end_date,duration,map_index,run_id 21007,xxx_task_id,,2024-01-26 09:08:19.937939,,-1,scheduled__2024-01-25T08:30:00+00:00 21008,xxx_task_id,,2024-01-26 09:08:54.288874,,-1,scheduled__2024-01-25T08:30:00+00:00 Surprisingly, there are no records in the airflow.ti.finish metric. The task was submitted once to the Kubernetes pod, as confirmed by both Airflow logs and Kubernetes logs. However, the Airflow logs contain the message "Was the task killed externally?" appearing twice. The task also has on_failure_callback. The slack message was received which means on_failure_callback was triggered. If the task is terminated after it starts to post some logs, it has "Task received SIGTERM signal" and also "Marking task as UP_FOR_RETRY." It is not tracked with airflow.scheduler.tasks.killed_externally metric. Airflow is deployed on Kubernetes, with a single scheduler pod using default configurations. There are four processes still running in the scheduler identified as "python /home/airflow/.local/bin/airflow scheduler -n -1." Logs: [logs.csv](https://github.com/apache/airflow/files/14071250/logs.csv) ### What you think should happen instead? 1. If a task is externally terminated, it should automatically retry based on the configured retry settings. 2. Even if a task killed with SIGTERM, it should be still tracked with airflow.scheduler.tasks.killed_externally metric ### How to reproduce 1. Trigger a task to run ( reproducable both with manual through UI or scheduled) 2. Find pod which was created for this task with `kubectl get pods --watch` 3. Just after this kill the pod with `kubectl delete pod pod_name` (it should be killed before a task starts to print some logs) Worker Pod status from `kubectl get pods --watch` xxx_pod_name 0/1 ContainerCreating 0 0s xxx_pod_name 1/1 Running 0 1s xxx_pod_name 1/1 Terminating 0 6s There should be no Kubernetes configurations in the pod template that enable grace termination (terminationGracePeriodSeconds=0) for a pod. ### Operating System PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" ### Versions of Apache Airflow Providers apache-airflow-providers-cncf-kubernetes = "^7.13.0" airflow:2.7.3-python3.11 k8s cluster version 1.28 https://airflow-helm.github.io/charts v8.8.0 ### Deployment Official Apache Airflow Helm Chart ### Deployment details Deployed with helm https://airflow-helm.github.io/charts v8.8.0 ### Anything else? Logs are attached in a file in description. The task should be killed at the beginning before it starts to post a logs. ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
