[ 
https://issues.apache.org/jira/browse/AIRFLOW-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986044#comment-16986044
 ] 

Brent Driskill commented on AIRFLOW-6134:
-----------------------------------------

Digging into it a little more, I see the following:

The scheduler and workers communicate a status through the “job” table in the 
Postgres DB. The scheduler when it is running inserts an entry for itself with 
job_type “SchedulerJob”. It then updates the “latest_heartbeat” every 5 seconds 
and on shutdown it sets the job from ‘running’ state to ‘success’.

Then, when the scheduler triggers a DAG, it puts that DAG in the DB as running 
and sets the initial heartbeat. The worker then pulls that job and while it is 
working on it, the worker continually updates the latest_heartbeat. If you 
watch the table and the jobs in the worker, you will see that after a few 
minutes, they hang on a “connect” call to Postgres (using py-spy). The 
latest_heartbeat is never updated at that point. The scheduler has a 
“scheduler_zombie_task_threshold” where if the heartbeat is not updated within 
a certain amount of time (we have ours set to 6 minutes), then the scheduler 
retries the job. None of the jobs are completing because of this and the 
Airflow scheduler is repeatedly retrying them.

The scheduler is also intermittently hanging on the “connect” call too and 
leaving a SchedulerJob as running in the db.

> Scheduler hanging every 45 minutes
> ----------------------------------
>
>                 Key: AIRFLOW-6134
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6134
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.6
>            Reporter: Brent Driskill
>            Priority: Major
>
> We have been running Airflow successfully for the past few months. However, 
> starting on the morning of 11/27, the scheduler hung for an unknown reason. 
> After restarting it, it continued to hang every 30-45 minutes. We have 
> temporarily implemented a health check to restart it at this interval but the 
> scheduler continues to not be reliable.
> The last logs during the hang are the following (this is just logged over and 
> over, I assume this is the other thread):
> {code:java}
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1180} DEBUG - 0/2 DAG 
> parsing processes running
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1183} DEBUG - 0 file 
> paths queued for processing
> 17:56:44[2019-11-30 17:56:44,049] {dag_processing.py:1246} DEBUG - Queuing 
> the following files for processing:
> {code}
> The last logs before that loop were the following:
> {code:java}
> 17:56:38[2019-11-30 17:56:38,450] {settings.py:277} DEBUG - Disposing DB 
> connection pool (PID 2232)
> 17:56:39[2019-11-30 17:56:39,036] {scheduler_job.py:267} DEBUG - Waiting for 
> <Process(DagFileProcessor493-Process, stopped)>
> 17:56:39[2019-11-30 17:56:39,036] {dag_processing.py:1162} DEBUG - Processor 
> for <ommitted> finished
> {code}
> Doing a py-spy on the running process, I see it hung at the following place:
> {code:java}
> Thread 4566 (idle): "MainThread"    
> connect (psycopg2/__init__.py:130)    
> connect (sqlalchemy/engine/default.py:482)    
> connect (sqlalchemy/engine/strategies.py:114)    
> __connect (sqlalchemy/pool/base.py:639)    
> __init__ (sqlalchemy/pool/base.py:437)    
> _create_connection (sqlalchemy/pool/base.py:308)    
> _do_get (sqlalchemy/pool/impl.py:136)    
> checkout (sqlalchemy/pool/base.py:492)    
> _checkout (sqlalchemy/pool/base.py:760)    
> connect (sqlalchemy/pool/base.py:363)    
> _wrap_pool_connect (sqlalchemy/engine/base.py:2276)    
> _contextual_connect (sqlalchemy/engine/base.py:2242)    
> _optional_conn_ctx_manager (sqlalchemy/engine/base.py:2040)    
> __enter__ (contextlib.py:112)    
> _run_visitor (sqlalchemy/engine/base.py:2048)    
> create_all (sqlalchemy/sql/schema.py:4316)    
> prepare_models (celery/backends/database/session.py:54)    
> session_factory (celery/backends/database/session.py:59)    
> ResultSession (celery/backends/database/__init__.py:99)    
> _get_task_meta_for (celery/backends/database/__init__.py:122)    
> _inner (celery/backends/database/__init__.py:53)    
> get_task_meta (celery/backends/base.py:386)    
> _get_task_meta (celery/result.py:412)    
> state (celery/result.py:473)    
> fetch_celery_task_state (airflow/executors/celery_executor.py:106)    
> mapstar (multiprocessing/pool.py:44)    
> worker (multiprocessing/pool.py:121)    
> run (multiprocessing/process.py:99)    
> _bootstrap (multiprocessing/process.py:297)    
> _launch (multiprocessing/popen_fork.py:74)    
> __init__ (multiprocessing/popen_fork.py:20)    
> _Popen (multiprocessing/context.py:277)    
> start (multiprocessing/process.py:112)    
> _repopulate_pool (multiprocessing/pool.py:241)    
> __init__ (multiprocessing/pool.py:176)    
> Pool (multiprocessing/context.py:119)    
> sync (airflow/executors/celery_executor.py:245)    
> heartbeat (airflow/executors/base_executor.py:136)    
> _execute_helper (airflow/jobs/scheduler_job.py:1445)    
> _execute (airflow/jobs/scheduler_job.py:1356)    
> run (airflow/jobs/base_job.py:222)    
> scheduler (airflow/bin/cli.py:1042)    
> wrapper (airflow/utils/cli.py:74)    <module> (airflow:37)
> {code}
> We are utilizing Postgres as our results_backend and using the CeleryExecutor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to