yuqian90 commented on issue #7935:
URL: https://github.com/apache/airflow/issues/7935#issuecomment-839656436
Hi @ashb I would like to report that we've been seeing something similar to
this issue in Airflow 2.0.2 recently.
We are using airflow 2.0.2 with a single airflow-scheduler + a few
airflow-worker using CeleryExecutor and postgres backend running dozens of dags
each with hundreds to a few thousand tasks. Python version is 3.8.7.
Here's what we saw:
airflow-scheduler sometimes stops heartbeating and stops scheduling any
tasks. This seems to happen at random times, about once or twice a week. When
this happens, the last line in the scheduler log shows the following, i.e. it
stopped writing out any log after `receiving signal 15`. I did strace the
airflow scheduler process. It did not capture any other process sending it
signal 15. So most likely the signal 15 was sent by the scheduler to itself.
```
May 11 21:19:56 shaops43 airflow[12643]: [2021-05-11 21:19:56,908]
{base_executor.py:82} INFO - Adding to queue: ['airflow', 'tasks', 'run', ...]
May 11 21:19:56 shaops43 airflow[12643]: [2021-05-11 21:19:56,973]
{scheduler_job.py:746} INFO - Exiting gracefully upon receiving signal 15
```
When the scheduler was in this state, there was also a child `airflow
scheduler` process shown in `ps` which was spawned by the main `airflow
scheduler` process. I forgot `py-spy dump`, but I did use `py-spy top` to look
at the child `airflow scheduler` process. This was what I saw. It seems to be
stuck somewhere in `celery_executor.py::_send_tasks_to_celery`. This sounds
similar to what @milton0825 reported previously although he mentioned he was
using Airflow 1.10.8.
When I manually SIGTERM the child airflow scheduler process, it died. And
immediately the main `airflow scheduler` started to heartbeat and schedule
tasks again like nothing ever happened. So I suspect somewhere when the
`airflow scheduler` was spawning a child processes, it got stuck. But I still
don't understand how it produced a `Exiting gracefully upon receiving signal
15` in the log.
```
Total Samples 7859
GIL: 0.00%, Active: 0.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename:line)
0.00% 0.00% 0.540s 0.540s __enter__
(multiprocessing/synchronize.py:95)
0.00% 0.00% 0.000s 0.540s worker (multiprocessing/pool.py:114)
0.00% 0.00% 0.000s 0.540s _bootstrap
(multiprocessing/process.py:315)
0.00% 0.00% 0.000s 0.540s _repopulate_pool
(multiprocessing/pool.py:303)
0.00% 0.00% 0.000s 0.540s main (airflow/__main__.py:40)
0.00% 0.00% 0.000s 0.540s start (multiprocessing/process.py:121)
0.00% 0.00% 0.000s 0.540s _send_tasks_to_celery
(airflow/executors/celery_executor.py:330)
0.00% 0.00% 0.000s 0.540s Pool (multiprocessing/context.py:119)
0.00% 0.00% 0.000s 0.540s run (airflow/jobs/base_job.py:237)
0.00% 0.00% 0.000s 0.540s _repopulate_pool_static
(multiprocessing/pool.py:326)
0.00% 0.00% 0.000s 0.540s heartbeat
(airflow/executors/base_executor.py:158)
0.00% 0.00% 0.000s 0.540s _launch
(multiprocessing/popen_fork.py:75)
0.00% 0.00% 0.000s 0.540s wrapper (airflow/utils/cli.py:89)
0.00% 0.00% 0.000s 0.540s __init__ (multiprocessing/pool.py:212)
0.00% 0.00% 0.000s 0.540s _Popen (multiprocessing/context.py:277)
```
One other observation was that when the airflow scheduler was in the stuck
state, the `DagFileProcessor` processes started by airflow scheduler were still
running. I could see them writing out logs to `dag_processor_manager.log`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]