[
https://issues.apache.org/jira/browse/AIRFLOW-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829578#comment-16829578
]
Ash Berlin-Taylor edited comment on AIRFLOW-4424 at 4/30/19 11:48 AM:
----------------------------------------------------------------------
[~ash] Here are some of the logs i've found from the scheduler pod where one of
the defunct pids is in the scheduler logs
PID: 8612 is found as one of the child defunct processes of the scheduler
One other interesting thing to note is that about 1850 DB connections were
invalidated.
{noformat}
{"log":"[2019-04-26 13:51:29,522] {{dag_processing.py:515}} INFO - Launched
DagFileProcessorManager with pid:
8612\n","stream":"stdout","time":"2019-04-26T13:51:29.522530422Z"}
{"log":"[2019-04-26 13:51:29,525] {{kubernetes_executor.py:450}} INFO -
Attempting to finish pod; pod_id:
exampledagrunthis-c62f39ca5f654392b3a68e1c5c1fe50e; state: None; labels:
\{'airflow-worker': '217d9fa7-0a29-4126-b2e1-1a640478022a', 'dag_id':
'example_dag', 'execution_date': '2019-04-26T13_50_00_plus_00_00', 'task_id':
'run_this', 'try_number':
'1'}\n","stream":"stdout","time":"2019-04-26T13:51:29.525893283Z"}
{"log":"[2019-04-26 13:51:29,528] {{settings.py:53}} INFO - Configured default
timezone \u003cTimezone
[America/Los_Angeles]\u003e\n","stream":"stdout","time":"2019-04-26T13:51:29.528970108Z"}
{"log":"[2019-04-26 13:51:29,539] {{kubernetes_executor.py:573}} INFO -
Checking 2 task
instances.\n","stream":"stdout","time":"2019-04-26T13:51:29.53977293Z"}
{"log":"[2019-04-26 13:51:29,539] {{kubernetes_executor.py:583}} INFO - Found
matching task example_dag-run_this (2019-04-26 13:50:00+00:00) with current
state of success\n","stream":"stdout","time":"2019-04-26T13:51:29.539923359Z"}
{"log":"[2019-04-26 13:51:29,539] {{settings.py:182}} INFO -
settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800,
pid=8612\n","stream":"stdout","time":"2019-04-26T13:51:29.540097272Z"}
{noformat}
was (Author: bnutt):
[~ash] Here are some of the logs i've found from the scheduler pod where one of
the defunct pids is in the scheduler logs
PID: 8612 is found as one of the child defunct processes of the scheduler
One other interesting thing to note is that about 1850 DB connections were
invalidated.
{"log":"[2019-04-26 13:51:29,522] {{dag_processing.py:515}} INFO - Launched
DagFileProcessorManager with pid:
8612\n","stream":"stdout","time":"2019-04-26T13:51:29.522530422Z"}
{"log":"[2019-04-26 13:51:29,525] {{kubernetes_executor.py:450}} INFO -
Attempting to finish pod; pod_id:
exampledagrunthis-c62f39ca5f654392b3a68e1c5c1fe50e; state: None; labels:
\{'airflow-worker': '217d9fa7-0a29-4126-b2e1-1a640478022a', 'dag_id':
'example_dag', 'execution_date': '2019-04-26T13_50_00_plus_00_00', 'task_id':
'run_this', 'try_number':
'1'}\n","stream":"stdout","time":"2019-04-26T13:51:29.525893283Z"}
{"log":"[2019-04-26 13:51:29,528] {{settings.py:53}} INFO - Configured default
timezone \u003cTimezone
[America/Los_Angeles]\u003e\n","stream":"stdout","time":"2019-04-26T13:51:29.528970108Z"}
{"log":"[2019-04-26 13:51:29,539] {{kubernetes_executor.py:573}} INFO -
Checking 2 task
instances.\n","stream":"stdout","time":"2019-04-26T13:51:29.53977293Z"}
{"log":"[2019-04-26 13:51:29,539] {{kubernetes_executor.py:583}} INFO - Found
matching task example_dag-run_this (2019-04-26 13:50:00+00:00) with current
state of success\n","stream":"stdout","time":"2019-04-26T13:51:29.539923359Z"}
{"log":"[2019-04-26 13:51:29,539] {{settings.py:182}} INFO -
settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800,
pid=8612\n","stream":"stdout","time":"2019-04-26T13:51:29.540097272Z"}
> Scheduler does not terminate after num_runs when executor is
> KubernetesExecutor
> -------------------------------------------------------------------------------
>
> Key: AIRFLOW-4424
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4424
> Project: Apache Airflow
> Issue Type: Bug
> Components: kubernetes, scheduler
> Affects Versions: 1.10.3
> Environment: EKS, deployed with stable airflow helm chart
> Reporter: Brian Nutt
> Priority: Blocker
> Fix For: 1.10.4
>
>
> When using the executor like the CeleryExecutor and num_runs is set on the
> scheduler, the scheduler pod restarts after num runs have completed. After
> switching to KubernetesExecutor, the scheduler logs:
> [2019-04-26 19:20:43,562] \{{kubernetes_executor.py:770}} INFO - Shutting
> down Kubernetes executor
> However, the scheduler process does not complete. This leads to the scheduler
> pod never restarting and running num_runs again. Resulted in having to roll
> back to CeleryExecutor because if num_runs is -1, the scheduler builds up
> tons of defunct processes, which is eventually making tasks not able to be
> scheduled as the underlying nodes have run out of file descriptors.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)