[ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932286#comment-16932286 ]
Neil Calabroso commented on AIRFLOW-401: ---------------------------------------- Currently experiencing this issue in `Ubuntu 14.04` using `python 3.6.8`. This started when we upgraded our staging environment from `1.10.1` to `1.10.4`. We're using `LocalExecutor` and the process is handled by upstart. I'm also getting the issue in the Web UI: The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago. For this sample, I got 3 stuck processes: {code:java} root@airflow-staging/home/ubuntu# ps aux | grep scheduler airflow 21595 0.2 1.3 469868 109976 ? S 09:52 0:04 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5 airflow 21602 0.0 1.1 1500268 95992 ? Tl 09:52 0:00 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5 airflow 21648 0.0 1.1 467796 94628 ? S 09:52 0:00 /usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5 root 25735 0.0 0.0 10472 920 pts/3 S+ 10:24 0:00 grep --color=auto scheduler {code} Running py-spy to each process gives {code:java} Collecting samples from 'pid: 21595' (python v3.6.8) Total Samples 500 GIL: 0.00%, Active: 100.00%, Threads: 1 %Own %Total OwnTime TotalTime Function (filename:line) 100.00% 100.00% 5.00s 5.00s _recv (multiprocessing/connection.py:379) 0.00% 100.00% 0.000s 5.00s wrapper (airflow/utils/cli.py:74) 0.00% 100.00% 0.000s 5.00s scheduler (airflow/bin/cli.py:1013) 0.00% 100.00% 0.000s 5.00s end (airflow/executors/local_executor.py:233) 0.00% 100.00% 0.000s 5.00s <module> (airflow:32) 0.00% 100.00% 0.000s 5.00s recv (multiprocessing/connection.py:250) 0.00% 100.00% 0.000s 5.00s _execute (airflow/jobs/scheduler_job.py:1323) 0.00% 100.00% 0.000s 5.00s end (airflow/executors/local_executor.py:212) 0.00% 100.00% 0.000s 5.00s _callmethod (multiprocessing/managers.py:757) 0.00% 100.00% 0.000s 5.00s join (<string>:2) 0.00% 100.00% 0.000s 5.00s _recv_bytes (multiprocessing/connection.py:407) 0.00% 100.00% 0.000s 5.00s _execute_helper (airflow/jobs/scheduler_job.py:1463) 0.00% 100.00% 0.000s 5.00s run (airflow/jobs/base_job.py:213){code} {code:java} root@airflow-staging:/home/ubuntu# py-spy --pid 21602 Error: Failed to suspend process Reason: EPERM: Operation not permitted{code} {code:java} Collecting samples from 'pid: 21648' (python v3.6.8) Total Samples 28381 GIL: 0.00%, Active: 100.00%, Threads: 1 %Own %Total OwnTime TotalTime Function (filename:line) 100.00% 100.00% 283.8s 283.8s _try_wait (subprocess.py:1424) 0.00% 100.00% 0.000s 283.8s call (subprocess.py:289) 0.00% 100.00% 0.000s 283.8s start (airflow/executors/local_executor.py:184) 0.00% 100.00% 0.000s 283.8s wrapper (airflow/utils/cli.py:74) 0.00% 100.00% 0.000s 283.8s _bootstrap (multiprocessing/process.py:258) 0.00% 100.00% 0.000s 283.8s _execute_helper (airflow/jobs/scheduler_job.py:1347) 0.00% 100.00% 0.000s 283.8s execute_work (airflow/executors/local_executor.py:86) 0.00% 100.00% 0.000s 283.8s <module> (airflow:32) 0.00% 100.00% 0.000s 283.8s _launch (multiprocessing/popen_fork.py:73) 0.00% 100.00% 0.000s 283.8s run (airflow/jobs/base_job.py:213) 0.00% 100.00% 0.000s 283.8s check_call (subprocess.py:306) 0.00% 100.00% 0.000s 283.8s start (multiprocessing/process.py:105) 0.00% 100.00% 0.000s 283.8s run (airflow/executors/local_executor.py:116) 0.00% 100.00% 0.000s 283.8s wait (subprocess.py:1477) 0.00% 100.00% 0.000s 283.8s scheduler (airflow/bin/cli.py:1013) 0.00% 100.00% 0.000s 283.8s _Popen (multiprocessing/context.py:277) 0.00% 100.00% 0.000s 283.8s _Popen (multiprocessing/context.py:223) 0.00% 100.00% 0.000s 283.8s start (airflow/executors/local_executor.py:224) 0.00% 100.00% 0.000s 283.8s _execute (airflow/jobs/scheduler_job.py:1323) 0.00% 100.00% 0.000s 283.8s __init__ (multiprocessing/popen_fork.py:19) {code} We will try to downgrade to `1.10.3` first and see if this problem persists. > scheduler gets stuck without a trace > ------------------------------------ > > Key: AIRFLOW-401 > URL: https://issues.apache.org/jira/browse/AIRFLOW-401 > Project: Apache Airflow > Issue Type: Bug > Components: executors, scheduler > Affects Versions: 1.7.1.3 > Reporter: Nadeem Ahmed Nazeer > Assignee: Bolke de Bruin > Priority: Minor > Labels: celery, kombu > Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, > scheduler_stuck_7hours.png > > > The scheduler gets stuck without a trace or error. When this happens, the CPU > usage of scheduler service is at 100%. No jobs get submitted and everything > comes to a halt. Looks it goes into some kind of infinite loop. > The only way I could make it run again is by manually restarting the > scheduler service. But again, after running some tasks it gets stuck. I've > tried with both Celery and Local executors but same issue occurs. I am using > the -n 3 parameter while starting scheduler. > Scheduler configs, > job_heartbeat_sec = 5 > scheduler_heartbeat_sec = 5 > executor = LocalExecutor > parallelism = 32 > Please help. I would be happy to provide any other information needed -- This message was sent by Atlassian Jira (v8.3.4#803005)