[I] Many running task instances are cleared by the new scheduler when an old scheduler is terminated and it health check server is periodically requested [airflow]

via GitHub Wed, 17 Apr 2024 05:54:25 -0700


tanvn opened a new issue, #39088:
URL: https://github.com/apache/airflow/issues/39088


   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.5.0
   
   ### What happened?
   
   I am running Airflow 2.5.0 with Kubernetes executor.
   Recently, I have enable [the health check server of the 
scheduler](https://airflow.apache.org/docs/apache-airflow/2.5.0/logging-monitoring/check-health.html#scheduler-health-check-server)
 and configured a [blackbox 
exporter](https://github.com/prometheus/blackbox_exporter) which sends a 
request to check if the scheduler is healthy or not every 6-7seconds.
   Normally, everything works fine. 
   However, when a new deployment is rolled out, the old scheduler is 
terminated and a new one is created, I found that many running task instances 
are cleared, i.e: the work pods are terminated and up to retry (some of them 
are heavy tasks, so this is quite bad for us).
   
   And if I disable the blackbox exporter (ie: stop sending GET requests to the 
health server of the scheduler), the issue won't happen (no cleared task 
instances, just adopted). So I am considering there is something wrong with the 
logic to determine which task instances should be cleared instead of adopted.
   
   Log from the new scheduler
   ```
   [2024-04-17T08:14:24.568+0000] {scheduler_job.py:1463} INFO - Reset the 
following 102 orphaned TaskInstances:
   ```
   
   Log from a worker pod 
   
   ```
   [2024-04-17, 16:56:29 JST] {local_task_job.py:223} WARNING - State of this 
instance has been externally set to restarting. Terminating instance.
   [2024-04-17, 16:56:29 JST] {process_utils.py:129} INFO - Sending 
Signals.SIGTERM to group 92. PIDs of all processes in the group: [350, 92]
   [2024-04-17, 16:56:29 JST] {process_utils.py:84} INFO - Sending the signal 
Signals.SIGTERM to group 92
   [2024-04-17, 16:56:29 JST] {taskinstance.py:1483} ERROR - Received SIGTERM. 
Terminating subprocesses.
   [2024-04-17, 16:56:29 JST] {taskinstance.py:1772} ERROR - Task failed with 
exception
   Traceback (most recent call last):
     ...
     File 
"/usr/local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 
1485, in signal_handler
       raise AirflowException("Task received SIGTERM signal")
   airflow.exceptions.AirflowException: Task received SIGTERM signal
   ```
   
   ### What you think should happen instead?
   
   I expect that the running task instances should be adopted correctly by the 
new scheduler so that the tasks can continue without being interrupted. 
   
   ### How to reproduce
   
   I described above.
   - Deploy with helm-chart and having many running task instances
   - setup an exporter to send a request to the scheduler every 6-7 seconds
   - Not HA mode, just a single scheduler.
   
   ### Operating System
   
   CentOS 7.9
   
   ### Versions of Apache Airflow Providers
   
   2.5.0 
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   - helm-chart version 1.7.0
   - a single scheduler
   
   scheduler section configuration:
   ```
       enable_health_check: true
       scheduler_health_check_server_port: 8974
       job_heartbeat_sec: 45
       scheduler_heartbeat_sec: 30
       scheduler_health_check_threshold: 90
   ```
   
   ### Anything else?
   
   The issue does not happen 100% when a new deployment is rolled out.
   Up to now, 3 of 5 times (60%)
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Many running task instances are cleared by the new scheduler when an old scheduler is terminated and it health check server is periodically requested [airflow]

Reply via email to