[ https://issues.apache.org/jira/browse/AIRFLOW-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brent Driskill updated AIRFLOW-6134: ------------------------------------ Summary: Scheduler hanging every 45 minutes, workers hanging on every job (was: Scheduler hanging every 45 minutes) > Scheduler hanging every 45 minutes, workers hanging on every job > ---------------------------------------------------------------- > > Key: AIRFLOW-6134 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6134 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler > Affects Versions: 1.10.6 > Reporter: Brent Driskill > Priority: Major > > We have been running Airflow successfully for the past few months. However, > starting on the morning of 11/27, the scheduler hung for an unknown reason. > After restarting it, it continued to hang every 30-45 minutes. We have > temporarily implemented a health check to restart it at this interval but the > scheduler continues to not be reliable. > The last logs during the hang are the following (this is just logged over and > over, I assume this is the other thread): > {code:java} > 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1180} DEBUG - 0/2 DAG > parsing processes running > 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1183} DEBUG - 0 file > paths queued for processing > 17:56:44[2019-11-30 17:56:44,049] {dag_processing.py:1246} DEBUG - Queuing > the following files for processing: > {code} > The last logs before that loop were the following: > {code:java} > 17:56:38[2019-11-30 17:56:38,450] {settings.py:277} DEBUG - Disposing DB > connection pool (PID 2232) > 17:56:39[2019-11-30 17:56:39,036] {scheduler_job.py:267} DEBUG - Waiting for > <Process(DagFileProcessor493-Process, stopped)> > 17:56:39[2019-11-30 17:56:39,036] {dag_processing.py:1162} DEBUG - Processor > for <ommitted> finished > {code} > Doing a py-spy on the running process, I see it hung at the following place: > {code:java} > Thread 4566 (idle): "MainThread" > connect (psycopg2/__init__.py:130) > connect (sqlalchemy/engine/default.py:482) > connect (sqlalchemy/engine/strategies.py:114) > __connect (sqlalchemy/pool/base.py:639) > __init__ (sqlalchemy/pool/base.py:437) > _create_connection (sqlalchemy/pool/base.py:308) > _do_get (sqlalchemy/pool/impl.py:136) > checkout (sqlalchemy/pool/base.py:492) > _checkout (sqlalchemy/pool/base.py:760) > connect (sqlalchemy/pool/base.py:363) > _wrap_pool_connect (sqlalchemy/engine/base.py:2276) > _contextual_connect (sqlalchemy/engine/base.py:2242) > _optional_conn_ctx_manager (sqlalchemy/engine/base.py:2040) > __enter__ (contextlib.py:112) > _run_visitor (sqlalchemy/engine/base.py:2048) > create_all (sqlalchemy/sql/schema.py:4316) > prepare_models (celery/backends/database/session.py:54) > session_factory (celery/backends/database/session.py:59) > ResultSession (celery/backends/database/__init__.py:99) > _get_task_meta_for (celery/backends/database/__init__.py:122) > _inner (celery/backends/database/__init__.py:53) > get_task_meta (celery/backends/base.py:386) > _get_task_meta (celery/result.py:412) > state (celery/result.py:473) > fetch_celery_task_state (airflow/executors/celery_executor.py:106) > mapstar (multiprocessing/pool.py:44) > worker (multiprocessing/pool.py:121) > run (multiprocessing/process.py:99) > _bootstrap (multiprocessing/process.py:297) > _launch (multiprocessing/popen_fork.py:74) > __init__ (multiprocessing/popen_fork.py:20) > _Popen (multiprocessing/context.py:277) > start (multiprocessing/process.py:112) > _repopulate_pool (multiprocessing/pool.py:241) > __init__ (multiprocessing/pool.py:176) > Pool (multiprocessing/context.py:119) > sync (airflow/executors/celery_executor.py:245) > heartbeat (airflow/executors/base_executor.py:136) > _execute_helper (airflow/jobs/scheduler_job.py:1445) > _execute (airflow/jobs/scheduler_job.py:1356) > run (airflow/jobs/base_job.py:222) > scheduler (airflow/bin/cli.py:1042) > wrapper (airflow/utils/cli.py:74) <module> (airflow:37) > {code} > We are utilizing Postgres as our results_backend and using the CeleryExecutor. -- This message was sent by Atlassian Jira (v8.3.4#803005)