[ 
https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085816#comment-16085816
 ] 

Rick Otten commented on AIRFLOW-401:
------------------------------------

Using the LocalExecutor with 1.8rc2, on Ubuntu 16.04,  we are still observing 
an issue that looks very similar to this.  What it looks like is happening is 
the scheduler spawns a number of child processes _parallelism = X_ .  When a 
task runs, it consumes one of these child processes.  When the task finishes 
(we are mostly using the SSHExecuteOperator and the PostgresOperator in our 
tasks), the child process is marked as *defunct* by the Operating system.   The 
Scheduler/LocalExecutor will not reuse that child process.  It is used up.

If there is a break in the tasks to be run, the scheduler restarts itself, 
which resets all of the defunct child processes back to a ready state.  
However, if you have a long running task, mixed with a bunch of short running 
tasks, the short running tasks will use up all of the available children.  The 
scheduler then queues all new jobs that come along until that one long running 
task finishes and the scheduler can restart itself to clear the child pool.

We will try setting up Celery to run our tasks.  Hopefully that will help keep 
things running when they are expected to run.



> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>         Attachments: Dag_code.txt, schduler_cpu100%.png, 
> scheduler_stuck_7hours.png, scheduler_stuck.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU 
> usage of scheduler service is at 100%. No jobs get submitted and everything 
> comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the 
> scheduler service. But again, after running some tasks it gets stuck. I've 
> tried with both Celery and Local executors but same issue occurs. I am using 
> the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to