supritiy opened a new issue #22350:
URL: https://github.com/apache/airflow/issues/22350


   ### Apache Airflow version
   
   2.2.4 (latest released)
   
   ### What happened
   
   We recently upgraded from Airflow 1.10.15 to 2.2.4. Our stack runs on GKE, 
using CeleryExecutor and Postgres 13 as result backend and airflow database. 
Worker, webserver and scheduler processes run in the own containers. Whenever a 
worker pod terminates sending SIGTERM to the airflow celery worker process, 
some long running tasks ( mostly sensors ) are left in a`running` state. I 
understand that celery tries to gracefully shut down, but the tasks can't 
complete in the limited grace time Kubernetes allows. Is there way to kill 
these tasks immediately? Since these are sensor tasks, we don't mind them 
failing.
   
   After the pod restarts, the scheduler detects these tasks as `zombies`. 
There are logs in the DAG processor that say `Executed failure callback for 
task <UP_FOR_RETRY>`. But the task instance state remains as `running` on the 
DB, so it is not retried and continues to be detected as a zombie until it is 
marked failed/success or cleared.
   
   ### What you think should happen instead
   
   Tasks should either terminate when sent SIGTERM or tasks detected by the 
scheduler as zombies should be correctly marked asc UP_FOR_RETRY / FAILED.
   
   ### How to reproduce
   
   Set the job_heartbeat_sec ( e.g. 3000 sec ) longer than 
scheduler_zombie_task_threshold ( 5 sec ). The scheduler should detect them as 
zombies but the state in the DB will remain unchanged. Scheduler should 
continue to detect them as zombie jobs.
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==3.0.0
   apache-airflow-providers-celery==2.1.0
   apache-airflow-providers-datadog==2.0.1
   apache-airflow-providers-ftp==2.0.1
   apache-airflow-providers-google==6.4.0
   apache-airflow-providers-http==2.0.3
   apache-airflow-providers-imap==2.2.0
   apache-airflow-providers-postgres==3.0.0
   apache-airflow-providers-sqlite==2.1.0
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   K8S running on GKE. 
   DB: Postgres 13
   Result backend: Postgres 13
   Broker: Redis
   
   
   ### Anything else
   
   This problem always happens on pod termination. Logs from DAG processor:
   
   ```
   [2022-03-17 12:21:11,145] {{processor.py:572}} DEBUG - Processing Callback 
Request: {'full_filepath': '/Users/*/projects/airflow/dags/app_installs.py', 
'msg': 'Detected <TaskInstance: app_installs.daily_ios_export_sensor 
scheduled__2022-03-16T12:00:00+00:00 [running]> as zombie', 
'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object 
at 0x10a2b4730>, 'is_failure_callback': True}
   [2022-03-17 12:21:11,165] {{logging_mixin.py:109}} INFO - [2022-03-17 
12:21:11,165] {{plugins_manager.py:287}} DEBUG - Loading plugins
   [2022-03-17 12:21:11,165] {{logging_mixin.py:109}} INFO - [2022-03-17 
12:21:11,165] {{plugins_manager.py:231}} DEBUG - Loading plugins from 
directory: /Users/supritiyeotikar/projects/airflow/plugins
   [2022-03-17 12:21:11,166] {{logging_mixin.py:109}} INFO - [2022-03-17 
12:21:11,166] {{plugins_manager.py:246}} DEBUG - Importing plugin module 
/Users/supritiyeotikar/projects/airflow/plugins/ssl_everything.py
   [2022-03-17 12:21:11,166] {{logging_mixin.py:109}} INFO - [2022-03-17 
12:21:11,166] {{plugins_manager.py:211}} DEBUG - Loading plugins from 
entrypoints
   [2022-03-17 12:21:11,181] {{logging_mixin.py:109}} INFO - [2022-03-17 
12:21:11,180] {{plugins_manager.py:303}} DEBUG - Loading 1 plugin(s) took 0.02 
seconds
   [2022-03-17 12:21:11,181] {{logging_mixin.py:109}} INFO - [2022-03-17 
12:21:11,181] {{plugins_manager.py:445}} DEBUG - Integrate DAG plugins
   [2022-03-17 12:21:11,190] {{processor.py:608}} INFO - Executed failure 
callback for <TaskInstance: app_installs.daily_ios_export_sensor 
scheduled__2022-03-16T12:00:00+00:00 [up_for_retry]> in state up_for_retry
   ```
   
   Logs on Scheduler:
   
   ```
   [2022-03-17 12:21:15,320] {{manager.py:1065}} INFO - Finding 'running' jobs 
without a recent heartbeat
   [2022-03-17 12:21:15,321] {{manager.py:1069}} INFO - Failing jobs without 
heartbeat after 2022-03-17 19:21:05.321936+00:00
   [2022-03-17 12:21:15,333] {{manager.py:1092}} INFO - Detected zombie job: 
{'full_filepath': '/Users/*/projects/airflow/dags/app_installs.py', 'msg': 
'Detected <TaskInstance: app_installs.daily_ios_export_sensor 
scheduled__2022-03-16T12:00:00+00:00 [running]> as zombie', 
'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object 
at 0x1090ba6a0>, 'is_failure_callback': True}
   [2022-03-17 12:21:15,935] {{settings.py:222}} DEBUG - Setting up DB 
connection pool (PID 16883)
   [2022-03-17 12:21:15,935] {{settings.py:311}} DEBUG - 
settings.prepare_engine_args(): Using pool settings. pool_size=10, 
max_overflow=10, pool_recycle=3600, pid=16883
   [2022-03-17 12:21:15,970] {{cli_action_loggers.py:40}} DEBUG - Adding 
<function default_action_log at 0x1125ba820> to pre execution callback
   [2022-03-17 12:21:16,690] {{settings.py:222}} DEBUG - Setting up DB 
connection pool (PID 16893)
   ```
   
   
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to