yuqian90 edited a comment on issue #7935:
URL: https://github.com/apache/airflow/issues/7935#issuecomment-839656436


   Hi @ashb I would like to report that we've been seeing something similar to 
this issue in Airflow 2.0.2 recently.
   
   We are using airflow 2.0.2 with a single airflow-scheduler + a few 
airflow-worker using CeleryExecutor and postgres backend running dozens of dags 
each with hundreds to a few thousand tasks. Python version is 3.8.7.
   
   Here's what we saw:
   airflow-scheduler sometimes stops heartbeating and stops scheduling any 
tasks. This seems to happen at random times, about once or twice a week. When 
this happens, the last line in the scheduler log shows the following, i.e. it 
stopped writing out any log after `receiving signal 15`. I did strace the 
airflow scheduler process. It did not capture any other process sending it 
signal 15. So most likely the signal 15 was sent by the scheduler to itself.
   
   ```
   May 11 21:19:56 airflow[12643]: [2021-05-11 21:19:56,908] 
{base_executor.py:82} INFO - Adding to queue: ['airflow', 'tasks', 'run', ...]
   May 11 21:19:56 airflow[12643]: [2021-05-11 21:19:56,973] 
{scheduler_job.py:746} INFO - Exiting gracefully upon receiving signal 15
   ```
   
   When the scheduler was in this state, there was also a child `airflow 
scheduler` process shown in `ps` which was spawned by the main `airflow 
scheduler` process. I forgot `py-spy dump`, but I did use `py-spy top` to look 
at the child `airflow scheduler` process. This was what I saw. It seems to be 
stuck somewhere in `celery_executor.py::_send_tasks_to_celery`. This sounds 
similar to what @milton0825  reported previously although he mentioned he was 
using Airflow 1.10.8.
   
   When I manually SIGTERM the child airflow scheduler process, it died. And 
immediately the main `airflow scheduler` started to heartbeat and schedule 
tasks again like nothing ever happened. So I suspect somewhere when the 
`airflow scheduler` was spawning a child processes, it got stuck. But I still 
don't understand how it produced a `Exiting gracefully upon receiving signal 
15` in the log.
   
   ```
   Total Samples 7859
   GIL: 0.00%, Active: 0.00%, Threads: 1
   
     %Own   %Total  OwnTime  TotalTime  Function (filename:line)
     0.00%   0.00%   0.540s    0.540s   __enter__ 
(multiprocessing/synchronize.py:95)
     0.00%   0.00%   0.000s    0.540s   worker (multiprocessing/pool.py:114)
     0.00%   0.00%   0.000s    0.540s   _bootstrap 
(multiprocessing/process.py:315)
     0.00%   0.00%   0.000s    0.540s   _repopulate_pool 
(multiprocessing/pool.py:303)
     0.00%   0.00%   0.000s    0.540s   main (airflow/__main__.py:40)
     0.00%   0.00%   0.000s    0.540s   start (multiprocessing/process.py:121)
     0.00%   0.00%   0.000s    0.540s   _send_tasks_to_celery 
(airflow/executors/celery_executor.py:330)
     0.00%   0.00%   0.000s    0.540s   Pool (multiprocessing/context.py:119)
     0.00%   0.00%   0.000s    0.540s   run (airflow/jobs/base_job.py:237)
     0.00%   0.00%   0.000s    0.540s   _repopulate_pool_static 
(multiprocessing/pool.py:326)
     0.00%   0.00%   0.000s    0.540s   heartbeat 
(airflow/executors/base_executor.py:158)
     0.00%   0.00%   0.000s    0.540s   _launch 
(multiprocessing/popen_fork.py:75)
     0.00%   0.00%   0.000s    0.540s   wrapper (airflow/utils/cli.py:89)
     0.00%   0.00%   0.000s    0.540s   __init__ (multiprocessing/pool.py:212)
     0.00%   0.00%   0.000s    0.540s   _Popen (multiprocessing/context.py:277)
   ```
   
   One other observation was that when the airflow scheduler was in the stuck 
state, the `DagFileProcessor` processes started by airflow scheduler were still 
running. I could see them writing out logs to `dag_processor_manager.log`. 
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to