[
https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932286#comment-16932286
]
Neil Calabroso commented on AIRFLOW-401:
----------------------------------------
Currently experiencing this issue in `Ubuntu 14.04` using `python 3.6.8`. This
started when we upgraded our staging environment from `1.10.1` to `1.10.4`.
We're using `LocalExecutor` and the process is handled by upstart.
I'm also getting the issue in the Web UI: The scheduler does not appear to be
running. Last heartbeat was received 9 minutes ago.
For this sample, I got 3 stuck processes:
{code:java}
root@airflow-staging/home/ubuntu# ps aux | grep scheduler
airflow 21595 0.2 1.3 469868 109976 ? S 09:52 0:04
/usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
airflow 21602 0.0 1.1 1500268 95992 ? Tl 09:52 0:00
/usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
airflow 21648 0.0 1.1 467796 94628 ? S 09:52 0:00
/usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
root 25735 0.0 0.0 10472 920 pts/3 S+ 10:24 0:00 grep
--color=auto scheduler
{code}
Running py-spy to each process gives
{code:java}
Collecting samples from 'pid: 21595' (python v3.6.8)
Total Samples 500
GIL: 0.00%, Active: 100.00%, Threads: 1 %Own %Total OwnTime TotalTime
Function (filename:line)
100.00% 100.00% 5.00s 5.00s _recv (multiprocessing/connection.py:379)
0.00% 100.00% 0.000s 5.00s wrapper (airflow/utils/cli.py:74)
0.00% 100.00% 0.000s 5.00s scheduler (airflow/bin/cli.py:1013)
0.00% 100.00% 0.000s 5.00s end
(airflow/executors/local_executor.py:233)
0.00% 100.00% 0.000s 5.00s <module> (airflow:32)
0.00% 100.00% 0.000s 5.00s recv (multiprocessing/connection.py:250)
0.00% 100.00% 0.000s 5.00s _execute
(airflow/jobs/scheduler_job.py:1323)
0.00% 100.00% 0.000s 5.00s end
(airflow/executors/local_executor.py:212)
0.00% 100.00% 0.000s 5.00s _callmethod
(multiprocessing/managers.py:757)
0.00% 100.00% 0.000s 5.00s join (<string>:2)
0.00% 100.00% 0.000s 5.00s _recv_bytes
(multiprocessing/connection.py:407)
0.00% 100.00% 0.000s 5.00s _execute_helper
(airflow/jobs/scheduler_job.py:1463)
0.00% 100.00% 0.000s 5.00s run (airflow/jobs/base_job.py:213){code}
{code:java}
root@airflow-staging:/home/ubuntu# py-spy --pid 21602
Error: Failed to suspend process
Reason: EPERM: Operation not permitted{code}
{code:java}
Collecting samples from 'pid: 21648' (python v3.6.8)
Total Samples 28381
GIL: 0.00%, Active: 100.00%, Threads: 1 %Own %Total OwnTime TotalTime
Function (filename:line)
100.00% 100.00% 283.8s 283.8s _try_wait (subprocess.py:1424)
0.00% 100.00% 0.000s 283.8s call (subprocess.py:289)
0.00% 100.00% 0.000s 283.8s start
(airflow/executors/local_executor.py:184)
0.00% 100.00% 0.000s 283.8s wrapper (airflow/utils/cli.py:74)
0.00% 100.00% 0.000s 283.8s _bootstrap (multiprocessing/process.py:258)
0.00% 100.00% 0.000s 283.8s _execute_helper
(airflow/jobs/scheduler_job.py:1347)
0.00% 100.00% 0.000s 283.8s execute_work
(airflow/executors/local_executor.py:86)
0.00% 100.00% 0.000s 283.8s <module> (airflow:32)
0.00% 100.00% 0.000s 283.8s _launch (multiprocessing/popen_fork.py:73)
0.00% 100.00% 0.000s 283.8s run (airflow/jobs/base_job.py:213)
0.00% 100.00% 0.000s 283.8s check_call (subprocess.py:306)
0.00% 100.00% 0.000s 283.8s start (multiprocessing/process.py:105)
0.00% 100.00% 0.000s 283.8s run
(airflow/executors/local_executor.py:116)
0.00% 100.00% 0.000s 283.8s wait (subprocess.py:1477)
0.00% 100.00% 0.000s 283.8s scheduler (airflow/bin/cli.py:1013)
0.00% 100.00% 0.000s 283.8s _Popen (multiprocessing/context.py:277)
0.00% 100.00% 0.000s 283.8s _Popen (multiprocessing/context.py:223)
0.00% 100.00% 0.000s 283.8s start
(airflow/executors/local_executor.py:224)
0.00% 100.00% 0.000s 283.8s _execute
(airflow/jobs/scheduler_job.py:1323)
0.00% 100.00% 0.000s 283.8s __init__ (multiprocessing/popen_fork.py:19)
{code}
We will try to downgrade to `1.10.3` first and see if this problem persists.
> scheduler gets stuck without a trace
> ------------------------------------
>
> Key: AIRFLOW-401
> URL: https://issues.apache.org/jira/browse/AIRFLOW-401
> Project: Apache Airflow
> Issue Type: Bug
> Components: executors, scheduler
> Affects Versions: 1.7.1.3
> Reporter: Nadeem Ahmed Nazeer
> Assignee: Bolke de Bruin
> Priority: Minor
> Labels: celery, kombu
> Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png,
> scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU
> usage of scheduler service is at 100%. No jobs get submitted and everything
> comes to a halt. Looks it goes into some kind of infinite loop.
> The only way I could make it run again is by manually restarting the
> scheduler service. But again, after running some tasks it gets stuck. I've
> tried with both Celery and Local executors but same issue occurs. I am using
> the -n 3 parameter while starting scheduler.
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed
--
This message was sent by Atlassian Jira
(v8.3.4#803005)