[ 
https://issues.apache.org/jira/browse/AIRFLOW-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985688#comment-16985688
 ] 

Brent Driskill edited comment on AIRFLOW-6134 at 12/2/19 12:47 AM:
-------------------------------------------------------------------

We are also seeing a similar issue with the airflow workers. I noticed that 
they hang and log the following message repeatedly (every 5 seconds):

[2019-12-01 14:47:28,790] \{local_task_job.py:124} WARNING - Time since last 
heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.960264 s

Doing a dump, I see the jobs hung in the same spot:
{code:java}
Thread 0x7F200B91FD48 (active): "MainThread"
    connect (__init__.py:130)
    connect (default.py:482)
    connect (strategies.py:114)
    __connect (base.py:639)
    __init__ (base.py:437)
    _create_connection (base.py:308)
    _do_get (impl.py:238)
    checkout (base.py:492)
    _checkout (base.py:760)
    connect (base.py:363)
    _wrap_pool_connect (base.py:2276)
    _contextual_connect (base.py:2242)
    _connection_for_bind (session.py:431)
    _connection_for_bind (session.py:1130)
    connection (session.py:1124)
    _connection_from_session (query.py:3382)
    _get_bind_args (query.py:3404)
    _execute_and_instances (query.py:3389)
    __iter__ (query.py:3367)
    one_or_none (query.py:3294)
    one (query.py:3325)
    load_on_pk_identity (loading.py:284)
    _get_impl (query.py:1094)
    get (query.py:982)
    _merge (session.py:2177)
    merge (session.py:2104)
    heartbeat (base_job.py:194)
    _execute (local_task_job.py:106)
    run (base_job.py:222)
    _run (cli.py:464)
    run (cli.py:551)
    wrapper (cli.py:74)
    <module> (airflow:37)
{code}


was (Author: brent.driskill.resonate):
We are also seeing a similar issue with the airflow workers. I noticed that 
they hang and log the following message repeatedly (every 5 seconds):

[2019-12-01 14:47:28,790] \{local_task_job.py:124} WARNING - Time since last 
heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.960264 s

> Scheduler hanging every 45 minutes
> ----------------------------------
>
>                 Key: AIRFLOW-6134
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6134
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.6
>            Reporter: Brent Driskill
>            Priority: Major
>
> We have been running Airflow successfully for the past few months. However, 
> starting on the morning of 11/27, the scheduler hung for an unknown reason. 
> After restarting it, it continued to hang every 30-45 minutes. We have 
> temporarily implemented a health check to restart it at this interval but the 
> scheduler continues to not be reliable.
> The last logs during the hang are the following (this is just logged over and 
> over, I assume this is the other thread):
> {code:java}
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1180} DEBUG - 0/2 DAG 
> parsing processes running
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1183} DEBUG - 0 file 
> paths queued for processing
> 17:56:44[2019-11-30 17:56:44,049] {dag_processing.py:1246} DEBUG - Queuing 
> the following files for processing:
> {code}
> The last logs before that loop were the following:
> {code:java}
> 17:56:38[2019-11-30 17:56:38,450] {settings.py:277} DEBUG - Disposing DB 
> connection pool (PID 2232)
> 17:56:39[2019-11-30 17:56:39,036] {scheduler_job.py:267} DEBUG - Waiting for 
> <Process(DagFileProcessor493-Process, stopped)>
> 17:56:39[2019-11-30 17:56:39,036] {dag_processing.py:1162} DEBUG - Processor 
> for <ommitted> finished
> {code}
> Doing a py-spy on the running process, I see it hung at the following place:
> {code:java}
> Thread 4566 (idle): "MainThread"    
> connect (psycopg2/__init__.py:130)    
> connect (sqlalchemy/engine/default.py:482)    
> connect (sqlalchemy/engine/strategies.py:114)    
> __connect (sqlalchemy/pool/base.py:639)    
> __init__ (sqlalchemy/pool/base.py:437)    
> _create_connection (sqlalchemy/pool/base.py:308)    
> _do_get (sqlalchemy/pool/impl.py:136)    
> checkout (sqlalchemy/pool/base.py:492)    
> _checkout (sqlalchemy/pool/base.py:760)    
> connect (sqlalchemy/pool/base.py:363)    
> _wrap_pool_connect (sqlalchemy/engine/base.py:2276)    
> _contextual_connect (sqlalchemy/engine/base.py:2242)    
> _optional_conn_ctx_manager (sqlalchemy/engine/base.py:2040)    
> __enter__ (contextlib.py:112)    
> _run_visitor (sqlalchemy/engine/base.py:2048)    
> create_all (sqlalchemy/sql/schema.py:4316)    
> prepare_models (celery/backends/database/session.py:54)    
> session_factory (celery/backends/database/session.py:59)    
> ResultSession (celery/backends/database/__init__.py:99)    
> _get_task_meta_for (celery/backends/database/__init__.py:122)    
> _inner (celery/backends/database/__init__.py:53)    
> get_task_meta (celery/backends/base.py:386)    
> _get_task_meta (celery/result.py:412)    
> state (celery/result.py:473)    
> fetch_celery_task_state (airflow/executors/celery_executor.py:106)    
> mapstar (multiprocessing/pool.py:44)    
> worker (multiprocessing/pool.py:121)    
> run (multiprocessing/process.py:99)    
> _bootstrap (multiprocessing/process.py:297)    
> _launch (multiprocessing/popen_fork.py:74)    
> __init__ (multiprocessing/popen_fork.py:20)    
> _Popen (multiprocessing/context.py:277)    
> start (multiprocessing/process.py:112)    
> _repopulate_pool (multiprocessing/pool.py:241)    
> __init__ (multiprocessing/pool.py:176)    
> Pool (multiprocessing/context.py:119)    
> sync (airflow/executors/celery_executor.py:245)    
> heartbeat (airflow/executors/base_executor.py:136)    
> _execute_helper (airflow/jobs/scheduler_job.py:1445)    
> _execute (airflow/jobs/scheduler_job.py:1356)    
> run (airflow/jobs/base_job.py:222)    
> scheduler (airflow/bin/cli.py:1042)    
> wrapper (airflow/utils/cli.py:74)    <module> (airflow:37)
> {code}
> We are utilizing Postgres as our results_backend and using the CeleryExecutor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to