[jira] [Updated] (AIRFLOW-515) Airflow system is stuck without signs of life

Robert Kozikowski (JIRA) Mon, 19 Sep 2016 12:46:03 -0700

     [ 
https://issues.apache.org/jira/browse/AIRFLOW-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Kozikowski updated AIRFLOW-515:
--------------------------------------
    Description: 
h1. Symptoms
Triggering a dag on the web node via the trigger_dag does not start it. Trigger 
the same dag worked a moment before and I did not apply any config changes in 
the mean time.

Triggering the dag (airflow trigger_dag on the web node) only adds it to 
postgres table dag_run. The run does not appear in the flower or the main web 
gui. System was stuck for 50 minutes without any signs of life, and no logs 
coming from either scheduler, worker or webserver.

In the end, system got unblocked after force-restarting scheduler and worker. 
Is there any way I should health-check my scheduler to avoid similar situation 
in production?  

h1. Recent logs
Logs were collected at 19:10, the system was stuck for almost an hour.
h2. scheduler
{noformat}
/usr/lib/python3.5/site-packages/sqlalchemy/sql/default_comparator.py:153: 
SAWarning: The IN-predicate on "task_instance.execution_date" was invoked with 
an empty sequence. This results in a contradiction, which nonetheless can be 
expensive to evaluate.  Consider alternative strategies for improved 
performance.
  'strategies for improved performance.' % expr)
[2016-09-19 18:15:41,171] {jobs.py:741} INFO - Done queuing tasks, calling the 
executor's heartbeat
[2016-09-19 18:15:41,172] {jobs.py:744} INFO - Loop took: 0.37502 seconds
[2016-09-19 18:15:41,189] {models.py:305} INFO - Finding 'running' jobs without 
a recent heartbeat
[2016-09-19 18:15:41,189] {models.py:311} INFO - Failing jobs without heartbeat 
after 2016-09-19 18:13:26.189579
[2016-09-19 18:15:41,212] {celery_executor.py:64} INFO - [celery] queuing 
('repos_to_process_dag', 'update_repos_to_process', datetime.datetime(2016, 9, 
19, 18, 15, 35, 200082)) through celery, queue=default
[2016-09-19 18:15:45,817] {jobs.py:574} INFO - Prioritizing 0 queued jobs
[2016-09-19 18:15:45,831] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
[2016-09-19 18:15:45,881] {jobs.py:726} INFO - Starting 2 scheduler jobs
{noformat}
h2. web
{noformat}
[2016-09-19 18:10:12,213] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,597] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,663] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,790] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:13,210] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
[2016-09-19 18:10:13,742] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
[2016-09-19 18:10:13,815] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
[2016-09-19 18:10:13,912] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
{noformat}
h2. worker
{noformat}
Logging into: 
/usr/local/airflow/logs/repos_to_process_dag/update_repos_to_process/2016-09-19T18:15:35.200082
{noformat}


  was:
h1. Symptoms
Triggering a dag on the web node via the trigger_dag does not start it. Trigger 
the same dag worked a moment before and I did not apply any config changes in 
the mean time.

Triggering the dag only adds it to postgres table dag_run. The run does not 
appear in the flower or the main web gui. System was stuck for 50 minutes 
without any signs of life, and no logs coming from either scheduler, worker or 
webserver.

In the end, system got unblocked after force-restarting scheduler and worker. 
Is there any way I should health-check my scheduler to avoid similar situation 
in production?  

h1. Recent logs
Logs were collected at 19:10, the system was stuck for almost an hour.
h2. scheduler
{noformat}
/usr/lib/python3.5/site-packages/sqlalchemy/sql/default_comparator.py:153: 
SAWarning: The IN-predicate on "task_instance.execution_date" was invoked with 
an empty sequence. This results in a contradiction, which nonetheless can be 
expensive to evaluate.  Consider alternative strategies for improved 
performance.
  'strategies for improved performance.' % expr)
[2016-09-19 18:15:41,171] {jobs.py:741} INFO - Done queuing tasks, calling the 
executor's heartbeat
[2016-09-19 18:15:41,172] {jobs.py:744} INFO - Loop took: 0.37502 seconds
[2016-09-19 18:15:41,189] {models.py:305} INFO - Finding 'running' jobs without 
a recent heartbeat
[2016-09-19 18:15:41,189] {models.py:311} INFO - Failing jobs without heartbeat 
after 2016-09-19 18:13:26.189579
[2016-09-19 18:15:41,212] {celery_executor.py:64} INFO - [celery] queuing 
('repos_to_process_dag', 'update_repos_to_process', datetime.datetime(2016, 9, 
19, 18, 15, 35, 200082)) through celery, queue=default
[2016-09-19 18:15:45,817] {jobs.py:574} INFO - Prioritizing 0 queued jobs
[2016-09-19 18:15:45,831] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
[2016-09-19 18:15:45,881] {jobs.py:726} INFO - Starting 2 scheduler jobs
{noformat}
h2. web
{noformat}
[2016-09-19 18:10:12,213] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,597] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,663] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,790] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:13,210] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
[2016-09-19 18:10:13,742] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
[2016-09-19 18:10:13,815] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
[2016-09-19 18:10:13,912] {models.py:154} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
{noformat}
h2. worker
{noformat}
Logging into: 
/usr/local/airflow/logs/repos_to_process_dag/update_repos_to_process/2016-09-19T18:15:35.200082
{noformat}



> Airflow system is stuck without signs of life
> ---------------------------------------------
>
>                 Key: AIRFLOW-515
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-515
>             Project: Apache Airflow
>          Issue Type: Bug
>         Environment: airflow==1.7.1.3 Dockerfile: 
> https://gist.github.com/kozikow/9290c44f03fba8d0f1299f355ef96b51
> kubernetes config (running within minikube): 
> https://gist.github.com/kozikow/49c812f11d1e4fd43e465e8090619274
>            Reporter: Robert Kozikowski
>
> h1. Symptoms
> Triggering a dag on the web node via the trigger_dag does not start it. 
> Trigger the same dag worked a moment before and I did not apply any config 
> changes in the mean time.
> Triggering the dag (airflow trigger_dag on the web node) only adds it to 
> postgres table dag_run. The run does not appear in the flower or the main web 
> gui. System was stuck for 50 minutes without any signs of life, and no logs 
> coming from either scheduler, worker or webserver.
> In the end, system got unblocked after force-restarting scheduler and worker. 
> Is there any way I should health-check my scheduler to avoid similar 
> situation in production?  
> h1. Recent logs
> Logs were collected at 19:10, the system was stuck for almost an hour.
> h2. scheduler
> {noformat}
> /usr/lib/python3.5/site-packages/sqlalchemy/sql/default_comparator.py:153: 
> SAWarning: The IN-predicate on "task_instance.execution_date" was invoked 
> with an empty sequence. This results in a contradiction, which nonetheless 
> can be expensive to evaluate.  Consider alternative strategies for improved 
> performance.
>   'strategies for improved performance.' % expr)
> [2016-09-19 18:15:41,171] {jobs.py:741} INFO - Done queuing tasks, calling 
> the executor's heartbeat
> [2016-09-19 18:15:41,172] {jobs.py:744} INFO - Loop took: 0.37502 seconds
> [2016-09-19 18:15:41,189] {models.py:305} INFO - Finding 'running' jobs 
> without a recent heartbeat
> [2016-09-19 18:15:41,189] {models.py:311} INFO - Failing jobs without 
> heartbeat after 2016-09-19 18:13:26.189579
> [2016-09-19 18:15:41,212] {celery_executor.py:64} INFO - [celery] queuing 
> ('repos_to_process_dag', 'update_repos_to_process', datetime.datetime(2016, 
> 9, 19, 18, 15, 35, 200082)) through celery, queue=default
> [2016-09-19 18:15:45,817] {jobs.py:574} INFO - Prioritizing 0 queued jobs
> [2016-09-19 18:15:45,831] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> [2016-09-19 18:15:45,881] {jobs.py:726} INFO - Starting 2 scheduler jobs
> {noformat}
> h2. web
> {noformat}
> [2016-09-19 18:10:12,213] {__init__.py:36} INFO - Using executor 
> CeleryExecutor
> [2016-09-19 18:10:12,597] {__init__.py:36} INFO - Using executor 
> CeleryExecutor
> [2016-09-19 18:10:12,663] {__init__.py:36} INFO - Using executor 
> CeleryExecutor
> [2016-09-19 18:10:12,790] {__init__.py:36} INFO - Using executor 
> CeleryExecutor
> [2016-09-19 18:10:13,210] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> [2016-09-19 18:10:13,742] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> [2016-09-19 18:10:13,815] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> [2016-09-19 18:10:13,912] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> {noformat}
> h2. worker
> {noformat}
> Logging into: 
> /usr/local/airflow/logs/repos_to_process_dag/update_repos_to_process/2016-09-19T18:15:35.200082
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AIRFLOW-515) Airflow system is stuck without signs of life

Reply via email to