[ 
https://issues.apache.org/jira/browse/AIRFLOW-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kozikowski updated AIRFLOW-515:
--------------------------------------
    Summary: Airflow is stuck without signs of life  (was: Airflow system is 
stuck without signs of life)

> Airflow is stuck without signs of life
> --------------------------------------
>
>                 Key: AIRFLOW-515
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-515
>             Project: Apache Airflow
>          Issue Type: Bug
>         Environment: airflow==1.7.1.3 Dockerfile: 
> https://gist.github.com/kozikow/9290c44f03fba8d0f1299f355ef96b51
> kubernetes config (running within minikube): 
> https://gist.github.com/kozikow/49c812f11d1e4fd43e465e8090619274
>            Reporter: Robert Kozikowski
>
> h1. Symptoms
> Triggering a dag on the web node via the trigger_dag does not start it. I 
> triggered the same dag moment before and it worked. I did not apply any 
> changes in the mean time. 
> Very likely some deadlock on scheduler node deadlocked the whole system. 
> Triggering the dag (airflow trigger_dag on the web node) only adds it to 
> postgres table dag_run. The run does not appear in the flower or the main web 
> gui. System was stuck for 50 minutes without any signs of life, and no logs 
> coming from either scheduler, worker or webserver.
> In the end, system got unblocked after force-restarting scheduler and worker. 
> Is there any way I should health-check my scheduler to avoid similar 
> situation in production?  
> h1. Recent logs
> Logs were collected at 19:10, the system was stuck for almost an hour.
> h2. scheduler
> {noformat}
> /usr/lib/python3.5/site-packages/sqlalchemy/sql/default_comparator.py:153: 
> SAWarning: The IN-predicate on "task_instance.execution_date" was invoked 
> with an empty sequence. This results in a contradiction, which nonetheless 
> can be expensive to evaluate.  Consider alternative strategies for improved 
> performance.
>   'strategies for improved performance.' % expr)
> [2016-09-19 18:15:41,171] {jobs.py:741} INFO - Done queuing tasks, calling 
> the executor's heartbeat
> [2016-09-19 18:15:41,172] {jobs.py:744} INFO - Loop took: 0.37502 seconds
> [2016-09-19 18:15:41,189] {models.py:305} INFO - Finding 'running' jobs 
> without a recent heartbeat
> [2016-09-19 18:15:41,189] {models.py:311} INFO - Failing jobs without 
> heartbeat after 2016-09-19 18:13:26.189579
> [2016-09-19 18:15:41,212] {celery_executor.py:64} INFO - [celery] queuing 
> ('repos_to_process_dag', 'update_repos_to_process', datetime.datetime(2016, 
> 9, 19, 18, 15, 35, 200082)) through celery, queue=default
> [2016-09-19 18:15:45,817] {jobs.py:574} INFO - Prioritizing 0 queued jobs
> [2016-09-19 18:15:45,831] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> [2016-09-19 18:15:45,881] {jobs.py:726} INFO - Starting 2 scheduler jobs
> {noformat}
> h2. web
> {noformat}
> [2016-09-19 18:10:12,213] {__init__.py:36} INFO - Using executor 
> CeleryExecutor
> [2016-09-19 18:10:12,597] {__init__.py:36} INFO - Using executor 
> CeleryExecutor
> [2016-09-19 18:10:12,663] {__init__.py:36} INFO - Using executor 
> CeleryExecutor
> [2016-09-19 18:10:12,790] {__init__.py:36} INFO - Using executor 
> CeleryExecutor
> [2016-09-19 18:10:13,210] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> [2016-09-19 18:10:13,742] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> [2016-09-19 18:10:13,815] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> [2016-09-19 18:10:13,912] {models.py:154} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags
> {noformat}
> h2. worker
> {noformat}
> Logging into: 
> /usr/local/airflow/logs/repos_to_process_dag/update_repos_to_process/2016-09-19T18:15:35.200082
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to