[ 
https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932286#comment-16932286
 ] 

Neil Calabroso commented on AIRFLOW-401:
----------------------------------------

Currently experiencing this issue in `Ubuntu 14.04` using `python 3.6.8`. This 
started when we upgraded our staging environment from `1.10.1` to `1.10.4`. 
We're using `LocalExecutor` and the process is handled by upstart.

I'm also getting the issue in the Web UI:  The scheduler does not appear to be 
running. Last heartbeat was received 9 minutes ago.

For this sample, I got 3 stuck processes:

 
{code:java}
root@airflow-staging/home/ubuntu# ps aux | grep scheduler
airflow  21595  0.2  1.3 469868 109976 ?       S    09:52   0:04 
/usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
airflow  21602  0.0  1.1 1500268 95992 ?       Tl   09:52   0:00 
/usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
airflow  21648  0.0  1.1 467796 94628 ?        S    09:52   0:00 
/usr/bin/python3.6 /usr/local/bin/airflow scheduler -n 5
root     25735  0.0  0.0  10472   920 pts/3    S+   10:24   0:00 grep 
--color=auto scheduler
{code}
 

Running py-spy to each process gives

 
{code:java}
Collecting samples from 'pid: 21595' (python v3.6.8)
Total Samples 500
GIL: 0.00%, Active: 100.00%, Threads: 1  %Own   %Total  OwnTime  TotalTime  
Function (filename:line)
100.00% 100.00%    5.00s     5.00s   _recv (multiprocessing/connection.py:379)
  0.00% 100.00%   0.000s     5.00s   wrapper (airflow/utils/cli.py:74)
  0.00% 100.00%   0.000s     5.00s   scheduler (airflow/bin/cli.py:1013)
  0.00% 100.00%   0.000s     5.00s   end 
(airflow/executors/local_executor.py:233)
  0.00% 100.00%   0.000s     5.00s   <module> (airflow:32)
  0.00% 100.00%   0.000s     5.00s   recv (multiprocessing/connection.py:250)
  0.00% 100.00%   0.000s     5.00s   _execute 
(airflow/jobs/scheduler_job.py:1323)
  0.00% 100.00%   0.000s     5.00s   end 
(airflow/executors/local_executor.py:212)
  0.00% 100.00%   0.000s     5.00s   _callmethod 
(multiprocessing/managers.py:757)
  0.00% 100.00%   0.000s     5.00s   join (<string>:2)
  0.00% 100.00%   0.000s     5.00s   _recv_bytes 
(multiprocessing/connection.py:407)
  0.00% 100.00%   0.000s     5.00s   _execute_helper 
(airflow/jobs/scheduler_job.py:1463)
  0.00% 100.00%   0.000s     5.00s   run (airflow/jobs/base_job.py:213){code}
 
{code:java}
root@airflow-staging:/home/ubuntu# py-spy --pid 21602
Error: Failed to suspend process
Reason: EPERM: Operation not permitted{code}
 
{code:java}
Collecting samples from 'pid: 21648' (python v3.6.8)
Total Samples 28381
GIL: 0.00%, Active: 100.00%, Threads: 1  %Own   %Total  OwnTime  TotalTime  
Function (filename:line)
100.00% 100.00%   283.8s    283.8s   _try_wait (subprocess.py:1424)
  0.00% 100.00%   0.000s    283.8s   call (subprocess.py:289)
  0.00% 100.00%   0.000s    283.8s   start 
(airflow/executors/local_executor.py:184)
  0.00% 100.00%   0.000s    283.8s   wrapper (airflow/utils/cli.py:74)
  0.00% 100.00%   0.000s    283.8s   _bootstrap (multiprocessing/process.py:258)
  0.00% 100.00%   0.000s    283.8s   _execute_helper 
(airflow/jobs/scheduler_job.py:1347)
  0.00% 100.00%   0.000s    283.8s   execute_work 
(airflow/executors/local_executor.py:86)
  0.00% 100.00%   0.000s    283.8s   <module> (airflow:32)
  0.00% 100.00%   0.000s    283.8s   _launch (multiprocessing/popen_fork.py:73)
  0.00% 100.00%   0.000s    283.8s   run (airflow/jobs/base_job.py:213)
  0.00% 100.00%   0.000s    283.8s   check_call (subprocess.py:306)
  0.00% 100.00%   0.000s    283.8s   start (multiprocessing/process.py:105)
  0.00% 100.00%   0.000s    283.8s   run 
(airflow/executors/local_executor.py:116)
  0.00% 100.00%   0.000s    283.8s   wait (subprocess.py:1477)
  0.00% 100.00%   0.000s    283.8s   scheduler (airflow/bin/cli.py:1013)
  0.00% 100.00%   0.000s    283.8s   _Popen (multiprocessing/context.py:277)
  0.00% 100.00%   0.000s    283.8s   _Popen (multiprocessing/context.py:223)
  0.00% 100.00%   0.000s    283.8s   start 
(airflow/executors/local_executor.py:224)
  0.00% 100.00%   0.000s    283.8s   _execute 
(airflow/jobs/scheduler_job.py:1323)
  0.00% 100.00%   0.000s    283.8s   __init__ (multiprocessing/popen_fork.py:19)
{code}
 

We will try to downgrade to `1.10.3` first and see if this problem persists.

 

> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executors, scheduler
>    Affects Versions: 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>              Labels: celery, kombu
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, 
> scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU 
> usage of scheduler service is at 100%. No jobs get submitted and everything 
> comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the 
> scheduler service. But again, after running some tasks it gets stuck. I've 
> tried with both Celery and Local executors but same issue occurs. I am using 
> the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to