[ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15732334#comment-15732334 ]
Leonid Evdokimov commented on AIRFLOW-401: ------------------------------------------ I've seen alike problem with CeleryExecutor on 1.7.1.3 as well. Observable symptoms were following: * There is process tree with one parent processes of {{airflow scheduler -n 5}} and two children subprocesses * Parent process is using 100% CPU running in {{waitpid(W_NOHANG)}} busy-loop * Two children are stuck on {{recvfrom(<rabbitmq-socket>)}} I used {{puckel/docker-airflow:1.7.1.3-5}} docker image to reproduce the bug, the bug _MAY_ be specific to some version of celery / rabbitmq library. I'm not 100% sure, but it seems to me that there are two different bugs with alike symptoms in 1.7.1.3. There is possible workaround for the bug while running CeleryExecutor — it's possible to run scheduler (and *only* scheduler) with strict CPU time limit, so it'll be terminated when it enters busy-loop. Running {{prlimit --cpu=35:40 -- airflow scheduler -n 10}} solved the issue for me. I don't know what sort of inconsistencies may be triggered by killing the scheduler in the middle of execution, so use the recipe at your own risk :) > scheduler gets stuck without a trace > ------------------------------------ > > Key: AIRFLOW-401 > URL: https://issues.apache.org/jira/browse/AIRFLOW-401 > Project: Apache Airflow > Issue Type: Bug > Components: executor, scheduler > Affects Versions: Airflow 1.7.1.3 > Reporter: Nadeem Ahmed Nazeer > Assignee: Bolke de Bruin > Priority: Minor > Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, > scheduler_stuck_7hours.png > > > The scheduler gets stuck without a trace or error. When this happens, the CPU > usage of scheduler service is at 100%. No jobs get submitted and everything > comes to a halt. Looks it goes into some kind of infinite loop. > The only way I could make it run again is by manually restarting the > scheduler service. But again, after running some tasks it gets stuck. I've > tried with both Celery and Local executors but same issue occurs. I am using > the -n 3 parameter while starting scheduler. > Scheduler configs, > job_heartbeat_sec = 5 > scheduler_heartbeat_sec = 5 > executor = LocalExecutor > parallelism = 32 > Please help. I would be happy to provide any other information needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)