[
https://issues.apache.org/jira/browse/AIRFLOW-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829565#comment-16829565
]
Ash Berlin-Taylor commented on AIRFLOW-4424:
--------------------------------------------
Makes sense to fix this, but would also be good to fix the underlying issue and
track down and tidy up/reap the defunct processes themselves.
Could you check your logs to see if you can get any clue what function the
defunct processes are? (Look for "spawning"/launching type messages matching
one of the recent defunct pids in the scheduler logs?)
> Scheduler does not terminate after num_runs when executor is
> KubernetesExecutor
> -------------------------------------------------------------------------------
>
> Key: AIRFLOW-4424
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4424
> Project: Apache Airflow
> Issue Type: Bug
> Components: kubernetes, scheduler
> Affects Versions: 1.10.3
> Environment: EKS, deployed with stable airflow helm chart
> Reporter: Brian Nutt
> Priority: Blocker
> Fix For: 1.10.3, 1.10.4
>
>
> When using the executor like the CeleryExecutor and num_runs is set on the
> scheduler, the scheduler pod restarts after num runs have completed. After
> switching to KubernetesExecutor, the scheduler logs:
> [2019-04-26 19:20:43,562] \{{kubernetes_executor.py:770}} INFO - Shutting
> down Kubernetes executor
> However, the scheduler process does not complete. This leads to the scheduler
> pod never restarting and running num_runs again. Resulted in having to roll
> back to CeleryExecutor because if num_runs is -1, the scheduler builds up
> tons of defunct processes, which is eventually making tasks not able to be
> scheduled as the underlying nodes have run out of file descriptors.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)