supritiy opened a new issue #22350:
URL: https://github.com/apache/airflow/issues/22350
### Apache Airflow version
2.2.4 (latest released)
### What happened
We recently upgraded from Airflow 1.10.15 to 2.2.4. Our stack runs on GKE,
using CeleryExecutor and Postgres 13 as result backend and airflow database.
Worker, webserver and scheduler processes run in the own containers. Whenever a
worker pod terminates sending SIGTERM to the airflow celery worker process,
some long running tasks ( mostly sensors ) are left in a`running` state. I
understand that celery tries to gracefully shut down, but the tasks can't
complete in the limited grace time Kubernetes allows. Is there way to kill
these tasks immediately? Since these are sensor tasks, we don't mind them
failing.
After the pod restarts, the scheduler detects these tasks as `zombies`.
There are logs in the DAG processor that say `Executed failure callback for
task <UP_FOR_RETRY>`. But the task instance state remains as `running` on the
DB, so it is not retried and continues to be detected as a zombie until it is
marked failed/success or cleared.
### What you think should happen instead
Tasks should either terminate when sent SIGTERM or tasks detected by the
scheduler as zombies should be correctly marked asc UP_FOR_RETRY / FAILED.
### How to reproduce
Set the job_heartbeat_sec ( e.g. 3000 sec ) longer than
scheduler_zombie_task_threshold ( 5 sec ). The scheduler should detect them as
zombies but the state in the DB will remain unchanged. Scheduler should
continue to detect them as zombie jobs.
### Operating System
Debian GNU/Linux 11 (bullseye)
### Versions of Apache Airflow Providers
apache-airflow-providers-amazon==3.0.0
apache-airflow-providers-celery==2.1.0
apache-airflow-providers-datadog==2.0.1
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-google==6.4.0
apache-airflow-providers-http==2.0.3
apache-airflow-providers-imap==2.2.0
apache-airflow-providers-postgres==3.0.0
apache-airflow-providers-sqlite==2.1.0
### Deployment
Other Docker-based deployment
### Deployment details
K8S running on GKE.
DB: Postgres 13
Result backend: Postgres 13
Broker: Redis
### Anything else
This problem always happens on pod termination. Logs from DAG processor:
```
[2022-03-17 12:21:11,145] {{processor.py:572}} DEBUG - Processing Callback
Request: {'full_filepath': '/Users/*/projects/airflow/dags/app_installs.py',
'msg': 'Detected <TaskInstance: app_installs.daily_ios_export_sensor
scheduled__2022-03-16T12:00:00+00:00 [running]> as zombie',
'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object
at 0x10a2b4730>, 'is_failure_callback': True}
[2022-03-17 12:21:11,165] {{logging_mixin.py:109}} INFO - [2022-03-17
12:21:11,165] {{plugins_manager.py:287}} DEBUG - Loading plugins
[2022-03-17 12:21:11,165] {{logging_mixin.py:109}} INFO - [2022-03-17
12:21:11,165] {{plugins_manager.py:231}} DEBUG - Loading plugins from
directory: /Users/supritiyeotikar/projects/airflow/plugins
[2022-03-17 12:21:11,166] {{logging_mixin.py:109}} INFO - [2022-03-17
12:21:11,166] {{plugins_manager.py:246}} DEBUG - Importing plugin module
/Users/supritiyeotikar/projects/airflow/plugins/ssl_everything.py
[2022-03-17 12:21:11,166] {{logging_mixin.py:109}} INFO - [2022-03-17
12:21:11,166] {{plugins_manager.py:211}} DEBUG - Loading plugins from
entrypoints
[2022-03-17 12:21:11,181] {{logging_mixin.py:109}} INFO - [2022-03-17
12:21:11,180] {{plugins_manager.py:303}} DEBUG - Loading 1 plugin(s) took 0.02
seconds
[2022-03-17 12:21:11,181] {{logging_mixin.py:109}} INFO - [2022-03-17
12:21:11,181] {{plugins_manager.py:445}} DEBUG - Integrate DAG plugins
[2022-03-17 12:21:11,190] {{processor.py:608}} INFO - Executed failure
callback for <TaskInstance: app_installs.daily_ios_export_sensor
scheduled__2022-03-16T12:00:00+00:00 [up_for_retry]> in state up_for_retry
```
Logs on Scheduler:
```
[2022-03-17 12:21:15,320] {{manager.py:1065}} INFO - Finding 'running' jobs
without a recent heartbeat
[2022-03-17 12:21:15,321] {{manager.py:1069}} INFO - Failing jobs without
heartbeat after 2022-03-17 19:21:05.321936+00:00
[2022-03-17 12:21:15,333] {{manager.py:1092}} INFO - Detected zombie job:
{'full_filepath': '/Users/*/projects/airflow/dags/app_installs.py', 'msg':
'Detected <TaskInstance: app_installs.daily_ios_export_sensor
scheduled__2022-03-16T12:00:00+00:00 [running]> as zombie',
'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object
at 0x1090ba6a0>, 'is_failure_callback': True}
[2022-03-17 12:21:15,935] {{settings.py:222}} DEBUG - Setting up DB
connection pool (PID 16883)
[2022-03-17 12:21:15,935] {{settings.py:311}} DEBUG -
settings.prepare_engine_args(): Using pool settings. pool_size=10,
max_overflow=10, pool_recycle=3600, pid=16883
[2022-03-17 12:21:15,970] {{cli_action_loggers.py:40}} DEBUG - Adding
<function default_action_log at 0x1125ba820> to pre execution callback
[2022-03-17 12:21:16,690] {{settings.py:222}} DEBUG - Setting up DB
connection pool (PID 16893)
```
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]