[
https://issues.apache.org/jira/browse/AIRFLOW-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985688#comment-16985688
]
Brent Driskill edited comment on AIRFLOW-6134 at 12/2/19 12:47 AM:
-------------------------------------------------------------------
We are also seeing a similar issue with the airflow workers. I noticed that
they hang and log the following message repeatedly (every 5 seconds):
[2019-12-01 14:47:28,790] \{local_task_job.py:124} WARNING - Time since last
heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.960264 s
Doing a dump, I see the jobs hung in the same spot:
{code:java}
Thread 0x7F200B91FD48 (active): "MainThread"
connect (__init__.py:130)
connect (default.py:482)
connect (strategies.py:114)
__connect (base.py:639)
__init__ (base.py:437)
_create_connection (base.py:308)
_do_get (impl.py:238)
checkout (base.py:492)
_checkout (base.py:760)
connect (base.py:363)
_wrap_pool_connect (base.py:2276)
_contextual_connect (base.py:2242)
_connection_for_bind (session.py:431)
_connection_for_bind (session.py:1130)
connection (session.py:1124)
_connection_from_session (query.py:3382)
_get_bind_args (query.py:3404)
_execute_and_instances (query.py:3389)
__iter__ (query.py:3367)
one_or_none (query.py:3294)
one (query.py:3325)
load_on_pk_identity (loading.py:284)
_get_impl (query.py:1094)
get (query.py:982)
_merge (session.py:2177)
merge (session.py:2104)
heartbeat (base_job.py:194)
_execute (local_task_job.py:106)
run (base_job.py:222)
_run (cli.py:464)
run (cli.py:551)
wrapper (cli.py:74)
<module> (airflow:37)
{code}
was (Author: brent.driskill.resonate):
We are also seeing a similar issue with the airflow workers. I noticed that
they hang and log the following message repeatedly (every 5 seconds):
[2019-12-01 14:47:28,790] \{local_task_job.py:124} WARNING - Time since last
heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.960264 s
> Scheduler hanging every 45 minutes
> ----------------------------------
>
> Key: AIRFLOW-6134
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6134
> Project: Apache Airflow
> Issue Type: Bug
> Components: scheduler
> Affects Versions: 1.10.6
> Reporter: Brent Driskill
> Priority: Major
>
> We have been running Airflow successfully for the past few months. However,
> starting on the morning of 11/27, the scheduler hung for an unknown reason.
> After restarting it, it continued to hang every 30-45 minutes. We have
> temporarily implemented a health check to restart it at this interval but the
> scheduler continues to not be reliable.
> The last logs during the hang are the following (this is just logged over and
> over, I assume this is the other thread):
> {code:java}
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1180} DEBUG - 0/2 DAG
> parsing processes running
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1183} DEBUG - 0 file
> paths queued for processing
> 17:56:44[2019-11-30 17:56:44,049] {dag_processing.py:1246} DEBUG - Queuing
> the following files for processing:
> {code}
> The last logs before that loop were the following:
> {code:java}
> 17:56:38[2019-11-30 17:56:38,450] {settings.py:277} DEBUG - Disposing DB
> connection pool (PID 2232)
> 17:56:39[2019-11-30 17:56:39,036] {scheduler_job.py:267} DEBUG - Waiting for
> <Process(DagFileProcessor493-Process, stopped)>
> 17:56:39[2019-11-30 17:56:39,036] {dag_processing.py:1162} DEBUG - Processor
> for <ommitted> finished
> {code}
> Doing a py-spy on the running process, I see it hung at the following place:
> {code:java}
> Thread 4566 (idle): "MainThread"
> connect (psycopg2/__init__.py:130)
> connect (sqlalchemy/engine/default.py:482)
> connect (sqlalchemy/engine/strategies.py:114)
> __connect (sqlalchemy/pool/base.py:639)
> __init__ (sqlalchemy/pool/base.py:437)
> _create_connection (sqlalchemy/pool/base.py:308)
> _do_get (sqlalchemy/pool/impl.py:136)
> checkout (sqlalchemy/pool/base.py:492)
> _checkout (sqlalchemy/pool/base.py:760)
> connect (sqlalchemy/pool/base.py:363)
> _wrap_pool_connect (sqlalchemy/engine/base.py:2276)
> _contextual_connect (sqlalchemy/engine/base.py:2242)
> _optional_conn_ctx_manager (sqlalchemy/engine/base.py:2040)
> __enter__ (contextlib.py:112)
> _run_visitor (sqlalchemy/engine/base.py:2048)
> create_all (sqlalchemy/sql/schema.py:4316)
> prepare_models (celery/backends/database/session.py:54)
> session_factory (celery/backends/database/session.py:59)
> ResultSession (celery/backends/database/__init__.py:99)
> _get_task_meta_for (celery/backends/database/__init__.py:122)
> _inner (celery/backends/database/__init__.py:53)
> get_task_meta (celery/backends/base.py:386)
> _get_task_meta (celery/result.py:412)
> state (celery/result.py:473)
> fetch_celery_task_state (airflow/executors/celery_executor.py:106)
> mapstar (multiprocessing/pool.py:44)
> worker (multiprocessing/pool.py:121)
> run (multiprocessing/process.py:99)
> _bootstrap (multiprocessing/process.py:297)
> _launch (multiprocessing/popen_fork.py:74)
> __init__ (multiprocessing/popen_fork.py:20)
> _Popen (multiprocessing/context.py:277)
> start (multiprocessing/process.py:112)
> _repopulate_pool (multiprocessing/pool.py:241)
> __init__ (multiprocessing/pool.py:176)
> Pool (multiprocessing/context.py:119)
> sync (airflow/executors/celery_executor.py:245)
> heartbeat (airflow/executors/base_executor.py:136)
> _execute_helper (airflow/jobs/scheduler_job.py:1445)
> _execute (airflow/jobs/scheduler_job.py:1356)
> run (airflow/jobs/base_job.py:222)
> scheduler (airflow/bin/cli.py:1042)
> wrapper (airflow/utils/cli.py:74) <module> (airflow:37)
> {code}
> We are utilizing Postgres as our results_backend and using the CeleryExecutor.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)