[ 
https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15732334#comment-15732334
 ] 

Leonid Evdokimov commented on AIRFLOW-401:
------------------------------------------

I've seen alike problem with CeleryExecutor on 1.7.1.3 as well. Observable 
symptoms were following:

* There is process tree with one parent processes of {{airflow scheduler -n 5}} 
and two children subprocesses
* Parent process is using 100% CPU running in {{waitpid(W_NOHANG)}} busy-loop
* Two children are stuck on {{recvfrom(<rabbitmq-socket>)}}

I used {{puckel/docker-airflow:1.7.1.3-5}} docker image to reproduce the bug, 
the bug _MAY_ be specific to some version of celery / rabbitmq library.

I'm not 100% sure, but it seems to me that there are two different bugs with 
alike symptoms in 1.7.1.3.

There is possible workaround for the bug while running CeleryExecutor — it's 
possible to run scheduler (and *only* scheduler) with strict CPU time limit, so 
it'll be terminated when it enters busy-loop. Running {{prlimit --cpu=35:40 -- 
airflow scheduler -n 10}} solved the issue for me.

I don't know what sort of inconsistencies may be triggered by killing the 
scheduler in the middle of execution, so use the recipe at your own risk :)

> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, 
> scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU 
> usage of scheduler service is at 100%. No jobs get submitted and everything 
> comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the 
> scheduler service. But again, after running some tasks it gets stuck. I've 
> tried with both Celery and Local executors but same issue occurs. I am using 
> the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to