[jira] [Comment Edited] (AIRFLOW-4424) Scheduler does not terminate after num_runs when executor is KubernetesExecutor

Brian Nutt (JIRA) Mon, 29 Apr 2019 11:17:11 -0700


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829578#comment-16829578
 ]


Brian Nutt edited comment on AIRFLOW-4424 at 4/29/19 6:16 PM:
--------------------------------------------------------------

[~ash] Here are some of the logs i've found from the scheduler pod where one of 
the defunct pids is in the scheduler logs

PID: 8612 is found as one of the child defunct processes of the scheduler

 

One other interesting thing to note is that about 1850 DB connections were 
invalidated.

 
 {"log":"[2019-04-26 13:51:29,522] {{dag_processing.py:515}} INFO - Launched 
DagFileProcessorManager with pid: 
8612\n","stream":"stdout","time":"2019-04-26T13:51:29.522530422Z"}
 {"log":"[2019-04-26 13:51:29,525] {{kubernetes_executor.py:450}} INFO - 
Attempting to finish pod; pod_id: 
exampledagrunthis-c62f39ca5f654392b3a68e1c5c1fe50e; state: None; labels: 
\{'airflow-worker': '217d9fa7-0a29-4126-b2e1-1a640478022a', 'dag_id': 
'example_dag', 'execution_date': '2019-04-26T13_50_00_plus_00_00', 'task_id': 
'run_this', 'try_number': 
'1'}\n","stream":"stdout","time":"2019-04-26T13:51:29.525893283Z"}
 {"log":"[2019-04-26 13:51:29,528] {{settings.py:53}} INFO - Configured default 
timezone \u003cTimezone 
[America/Los_Angeles]\u003e\n","stream":"stdout","time":"2019-04-26T13:51:29.528970108Z"}
 {"log":"[2019-04-26 13:51:29,539] {{kubernetes_executor.py:573}} INFO - 
Checking 2 task 
instances.\n","stream":"stdout","time":"2019-04-26T13:51:29.53977293Z"}
 {"log":"[2019-04-26 13:51:29,539] {{kubernetes_executor.py:583}} INFO - Found 
matching task example_dag-run_this (2019-04-26 13:50:00+00:00) with current 
state of success\n","stream":"stdout","time":"2019-04-26T13:51:29.539923359Z"}
 {"log":"[2019-04-26 13:51:29,539] {{settings.py:182}} INFO - 
settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, 
pid=8612\n","stream":"stdout","time":"2019-04-26T13:51:29.540097272Z"}
  


was (Author: bnutt):
[~ash] Here are some of the logs i've found from the scheduler pod where one of 
the defunct pids is in the scheduler logs

PID: 8612 is found as one of the child defunct processes of the scheduler

 
{"log":"[2019-04-26 13:51:29,522] \{{dag_processing.py:515}} INFO - Launched 
DagFileProcessorManager with pid: 
8612\n","stream":"stdout","time":"2019-04-26T13:51:29.522530422Z"}
{"log":"[2019-04-26 13:51:29,525] \{{kubernetes_executor.py:450}} INFO - 
Attempting to finish pod; pod_id: 
exampledagrunthis-c62f39ca5f654392b3a68e1c5c1fe50e; state: None; labels: 
\{'airflow-worker': '217d9fa7-0a29-4126-b2e1-1a640478022a', 'dag_id': 
'example_dag', 'execution_date': '2019-04-26T13_50_00_plus_00_00', 'task_id': 
'run_this', 'try_number': 
'1'}\n","stream":"stdout","time":"2019-04-26T13:51:29.525893283Z"}
{"log":"[2019-04-26 13:51:29,528] \{{settings.py:53}} INFO - Configured default 
timezone \u003cTimezone 
[America/Los_Angeles]\u003e\n","stream":"stdout","time":"2019-04-26T13:51:29.528970108Z"}
{"log":"[2019-04-26 13:51:29,539] \{{kubernetes_executor.py:573}} INFO - 
Checking 2 task 
instances.\n","stream":"stdout","time":"2019-04-26T13:51:29.53977293Z"}
{"log":"[2019-04-26 13:51:29,539] \{{kubernetes_executor.py:583}} INFO - Found 
matching task example_dag-run_this (2019-04-26 13:50:00+00:00) with current 
state of success\n","stream":"stdout","time":"2019-04-26T13:51:29.539923359Z"}
{"log":"[2019-04-26 13:51:29,539] \{{settings.py:182}} INFO - 
settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, 
pid=8612\n","stream":"stdout","time":"2019-04-26T13:51:29.540097272Z"}
 

> Scheduler does not terminate after num_runs when executor is 
> KubernetesExecutor
> -------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4424
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4424
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: kubernetes, scheduler
>    Affects Versions: 1.10.3
>         Environment: EKS, deployed with stable airflow helm chart
>            Reporter: Brian Nutt
>            Priority: Blocker
>             Fix For: 1.10.4
>
>
> When using the executor like the CeleryExecutor and num_runs is set on the 
> scheduler, the scheduler pod restarts after num runs have completed. After 
> switching to KubernetesExecutor, the scheduler logs:
> [2019-04-26 19:20:43,562] \{{kubernetes_executor.py:770}} INFO - Shutting 
> down Kubernetes executor
> However, the scheduler process does not complete. This leads to the scheduler 
> pod never restarting and running num_runs again. Resulted in having to roll 
> back to CeleryExecutor because if num_runs is -1, the scheduler builds up 
> tons of defunct processes, which is eventually making tasks not able to be 
> scheduled as the underlying nodes have run out of file descriptors.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (AIRFLOW-4424) Scheduler does not terminate after num_runs when executor is KubernetesExecutor

Reply via email to