tanvn opened a new issue, #39088: URL: https://github.com/apache/airflow/issues/39088
### Apache Airflow version Other Airflow 2 version (please specify below) ### If "Other Airflow 2 version" selected, which one? 2.5.0 ### What happened? I am running Airflow 2.5.0 with Kubernetes executor. Recently, I have enable [the health check server of the scheduler](https://airflow.apache.org/docs/apache-airflow/2.5.0/logging-monitoring/check-health.html#scheduler-health-check-server) and configured a [blackbox exporter](https://github.com/prometheus/blackbox_exporter) which sends a request to check if the scheduler is healthy or not every 6-7seconds. Normally, everything works fine. However, when a new deployment is rolled out, the old scheduler is terminated and a new one is created, I found that many running task instances are cleared, i.e: the work pods are terminated and up to retry (some of them are heavy tasks, so this is quite bad for us). And if I disable the blackbox exporter (ie: stop sending GET requests to the health server of the scheduler), the issue won't happen (no cleared task instances, just adopted). So I am considering there is something wrong with the logic to determine which task instances should be cleared instead of adopted. Log from the new scheduler ``` [2024-04-17T08:14:24.568+0000] {scheduler_job.py:1463} INFO - Reset the following 102 orphaned TaskInstances: ``` Log from a worker pod ``` [2024-04-17, 16:56:29 JST] {local_task_job.py:223} WARNING - State of this instance has been externally set to restarting. Terminating instance. [2024-04-17, 16:56:29 JST] {process_utils.py:129} INFO - Sending Signals.SIGTERM to group 92. PIDs of all processes in the group: [350, 92] [2024-04-17, 16:56:29 JST] {process_utils.py:84} INFO - Sending the signal Signals.SIGTERM to group 92 [2024-04-17, 16:56:29 JST] {taskinstance.py:1483} ERROR - Received SIGTERM. Terminating subprocesses. [2024-04-17, 16:56:29 JST] {taskinstance.py:1772} ERROR - Task failed with exception Traceback (most recent call last): ... File "/usr/local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 1485, in signal_handler raise AirflowException("Task received SIGTERM signal") airflow.exceptions.AirflowException: Task received SIGTERM signal ``` ### What you think should happen instead? I expect that the running task instances should be adopted correctly by the new scheduler so that the tasks can continue without being interrupted. ### How to reproduce I described above. - Deploy with helm-chart and having many running task instances - setup an exporter to send a request to the scheduler every 6-7 seconds - Not HA mode, just a single scheduler. ### Operating System CentOS 7.9 ### Versions of Apache Airflow Providers 2.5.0 ### Deployment Official Apache Airflow Helm Chart ### Deployment details - helm-chart version 1.7.0 - a single scheduler scheduler section configuration: ``` enable_health_check: true scheduler_health_check_server_port: 8974 job_heartbeat_sec: 45 scheduler_heartbeat_sec: 30 scheduler_health_check_threshold: 90 ``` ### Anything else? The issue does not happen 100% when a new deployment is rolled out. Up to now, 3 of 5 times (60%) ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
