[
https://issues.apache.org/jira/browse/AIRFLOW-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829578#comment-16829578
]
Brian Nutt edited comment on AIRFLOW-4424 at 4/29/19 6:01 PM:
--------------------------------------------------------------
[~ash] Here are some of the logs i've found from the scheduler pod where one of
the defunct pids is in the scheduler logs
PID: 8612 is found as one of the child defunct processes of the scheduler
{noformat}
{"log":"Traceback (most recent call
last):\n","stream":"stderr","time":"2019-04-26T13:51:28.969860471Z"} {"log":"
File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 258, in
_bootstrap\n","stream":"stderr","time":"2019-04-26T13:51:28.969862916Z"}
{"log":"
self.run()\n","stream":"stderr","time":"2019-04-26T13:51:28.969865226Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 93,
in run\n","stream":"stderr","time":"2019-04-26T13:51:28.969867223Z"} {"log":"
self._target(*self._args,
**self._kwargs)\n","stream":"stderr","time":"2019-04-26T13:51:28.969869499Z"}
{"log":" File
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\",
line 563, in
helper\n","stream":"stderr","time":"2019-04-26T13:51:28.969871513Z"} {"log":"
processor_manager.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969873518Z"}
{"log":" File
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\",
line 796, in
start\n","stream":"stderr","time":"2019-04-26T13:51:28.969875327Z"} {"log":"
self.start_in_async()\n","stream":"stderr","time":"2019-04-26T13:51:28.969877469Z"}
{"log":" File
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\",
line 819, in
start_in_async\n","stream":"stderr","time":"2019-04-26T13:51:28.969879213Z"}
{"log":" simple_dags =
self.heartbeat()\n","stream":"stderr","time":"2019-04-26T13:51:28.969881187Z"}
{"log":" File
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\",
line 1208, in
heartbeat\n","stream":"stderr","time":"2019-04-26T13:51:28.96988305Z"} {"log":"
processor.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969885169Z"}
{"log":" File \"/usr/local/lib/python3.6/site-packages/airflow/jobs.py\", line
435, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969886977Z"}
{"log":"
self._zombies)\n","stream":"stderr","time":"2019-04-26T13:51:28.969888991Z"}
{"log":" File \"/usr/local/lib/python3.6/site-packages/airflow/jobs.py\", line
422, in
_launch_process\n","stream":"stderr","time":"2019-04-26T13:51:28.969890767Z"}
{"log":"
p.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969892715Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line
105, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969894479Z"}
{"log":" self._popen =
self._Popen(self)\n","stream":"stderr","time":"2019-04-26T13:51:28.969896484Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/context.py\", line
223, in _Popen\n","stream":"stderr","time":"2019-04-26T13:51:28.969898389Z"}
{"log":" return
_default_context.get_context().Process._Popen(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.969900517Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/context.py\", line
277, in _Popen\n","stream":"stderr","time":"2019-04-26T13:51:28.969902375Z"}
{"log":" return
Popen(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.96990443Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/popen_fork.py\", line
19, in __init__\n","stream":"stderr","time":"2019-04-26T13:51:28.969906219Z"}
{"log":"
self._launch(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.969908115Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/popen_fork.py\", line
66, in _launch\n","stream":"stderr","time":"2019-04-26T13:51:28.969909875Z"}
{"log":" self.pid =
os.fork()\n","stream":"stderr","time":"2019-04-26T13:51:28.969911766Z"}
{"log":"BlockingIOError: [Errno 11] Resource temporarily
unavailable\n","stream":"stderr","time":"2019-04-26T13:51:28.969913559Z"}
{"log":"[2019-04-26 13:51:29,522] dag_processing.py:515 INFO - Launched
DagFileProcessorManager with pid:
8612\n","stream":"stdout","time":"2019-04-26T13:51:29.522530422Z"}
{"log":"[2019-04-26 13:51:29,525] kubernetes_executor.py:450 INFO - Attempting
to finish pod; pod_id: exampledagrunthis-c62f39ca5f654392b3a68e1c5c1fe50e;
state: None; labels: {'airflow-worker': '217d9fa7-0a29-4126-b2e1-1a640478022a',
'dag_id': 'example_dag', 'execution_date': '2019-04-26T13_50_00_plus_00_00',
'task_id': 'run_this', 'try_number':
'1'}\n","stream":"stdout","time":"2019-04-26T13:51:29.525893283Z"}
{"log":"[2019-04-26 13:51:29,528] settings.py:53 INFO - Configured default
timezone \u003cTimezone
[America/Los_Angeles]\u003e\n","stream":"stdout","time":"2019-04-26T13:51:29.528970108Z"}
{"log":"[2019-04-26 13:51:29,539] kubernetes_executor.py:573 INFO - Checking 2
task instances.\n","stream":"stdout","time":"2019-04-26T13:51:29.53977293Z"}
{"log":"[2019-04-26 13:51:29,539] kubernetes_executor.py:583 INFO - Found
matching task example_dag-run_this (2019-04-26 13:50:00+00:00) with current
state of success\n","stream":"stdout","time":"2019-04-26T13:51:29.539923359Z"}
{"log":"[2019-04-26 13:51:29,539] settings.py:182 INFO -
settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800,
pid=8612\n","stream":"stdout","time":"2019-04-26T13:51:29.540097272Z"}{noformat}
was (Author: bnutt):
[~ash] Here are some of the logs i've found from the scheduler pod where one of
the defunct pids is in the scheduler logs
PID: 8612 is found as one of the child defunct processes of the scheduler
```
{"log":"Traceback (most recent call
last):\n","stream":"stderr","time":"2019-04-26T13:51:28.969860471Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line
258, in _bootstrap\n","stream":"stderr","time":"2019-04-26T13:51:28.969862916Z"}
{"log":"
self.run()\n","stream":"stderr","time":"2019-04-26T13:51:28.969865226Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 93,
in run\n","stream":"stderr","time":"2019-04-26T13:51:28.969867223Z"}
{"log":" self._target(*self._args,
**self._kwargs)\n","stream":"stderr","time":"2019-04-26T13:51:28.969869499Z"}
{"log":" File
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\",
line 563, in
helper\n","stream":"stderr","time":"2019-04-26T13:51:28.969871513Z"}
{"log":"
processor_manager.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969873518Z"}
{"log":" File
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\",
line 796, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969875327Z"}
{"log":"
self.start_in_async()\n","stream":"stderr","time":"2019-04-26T13:51:28.969877469Z"}
{"log":" File
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\",
line 819, in
start_in_async\n","stream":"stderr","time":"2019-04-26T13:51:28.969879213Z"}
{"log":" simple_dags =
self.heartbeat()\n","stream":"stderr","time":"2019-04-26T13:51:28.969881187Z"}
{"log":" File
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\",
line 1208, in
heartbeat\n","stream":"stderr","time":"2019-04-26T13:51:28.96988305Z"}
{"log":"
processor.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969885169Z"}
{"log":" File \"/usr/local/lib/python3.6/site-packages/airflow/jobs.py\", line
435, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969886977Z"}
{"log":"
self._zombies)\n","stream":"stderr","time":"2019-04-26T13:51:28.969888991Z"}
{"log":" File \"/usr/local/lib/python3.6/site-packages/airflow/jobs.py\", line
422, in
_launch_process\n","stream":"stderr","time":"2019-04-26T13:51:28.969890767Z"}
{"log":" p.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969892715Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line
105, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969894479Z"}
{"log":" self._popen =
self._Popen(self)\n","stream":"stderr","time":"2019-04-26T13:51:28.969896484Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/context.py\", line
223, in _Popen\n","stream":"stderr","time":"2019-04-26T13:51:28.969898389Z"}
{"log":" return
_default_context.get_context().Process._Popen(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.969900517Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/context.py\", line
277, in _Popen\n","stream":"stderr","time":"2019-04-26T13:51:28.969902375Z"}
{"log":" return
Popen(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.96990443Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/popen_fork.py\", line
19, in __init__\n","stream":"stderr","time":"2019-04-26T13:51:28.969906219Z"}
{"log":"
self._launch(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.969908115Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/popen_fork.py\", line
66, in _launch\n","stream":"stderr","time":"2019-04-26T13:51:28.969909875Z"}
{"log":" self.pid =
os.fork()\n","stream":"stderr","time":"2019-04-26T13:51:28.969911766Z"}
{"log":"BlockingIOError: [Errno 11] Resource temporarily
unavailable\n","stream":"stderr","time":"2019-04-26T13:51:28.969913559Z"}
{"log":"[2019-04-26 13:51:29,522] \{{dag_processing.py:515}} INFO - Launched
DagFileProcessorManager with pid:
8612\n","stream":"stdout","time":"2019-04-26T13:51:29.522530422Z"}
{"log":"[2019-04-26 13:51:29,525] \{{kubernetes_executor.py:450}} INFO -
Attempting to finish pod; pod_id:
exampledagrunthis-c62f39ca5f654392b3a68e1c5c1fe50e; state: None; labels:
\{'airflow-worker': '217d9fa7-0a29-4126-b2e1-1a640478022a', 'dag_id':
'example_dag', 'execution_date': '2019-04-26T13_50_00_plus_00_00', 'task_id':
'run_this', 'try_number':
'1'}\n","stream":"stdout","time":"2019-04-26T13:51:29.525893283Z"}
{"log":"[2019-04-26 13:51:29,528] \{{settings.py:53}} INFO - Configured default
timezone \u003cTimezone
[America/Los_Angeles]\u003e\n","stream":"stdout","time":"2019-04-26T13:51:29.528970108Z"}
{"log":"[2019-04-26 13:51:29,539] \{{kubernetes_executor.py:573}} INFO -
Checking 2 task
instances.\n","stream":"stdout","time":"2019-04-26T13:51:29.53977293Z"}
{"log":"[2019-04-26 13:51:29,539] \{{kubernetes_executor.py:583}} INFO - Found
matching task example_dag-run_this (2019-04-26 13:50:00+00:00) with current
state of success\n","stream":"stdout","time":"2019-04-26T13:51:29.539923359Z"}
{"log":"[2019-04-26 13:51:29,539] \{{settings.py:182}} INFO -
settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800,
pid=8612\n","stream":"stdout","time":"2019-04-26T13:51:29.540097272Z"}
```
> Scheduler does not terminate after num_runs when executor is
> KubernetesExecutor
> -------------------------------------------------------------------------------
>
> Key: AIRFLOW-4424
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4424
> Project: Apache Airflow
> Issue Type: Bug
> Components: kubernetes, scheduler
> Affects Versions: 1.10.3
> Environment: EKS, deployed with stable airflow helm chart
> Reporter: Brian Nutt
> Priority: Blocker
> Fix For: 1.10.4
>
>
> When using the executor like the CeleryExecutor and num_runs is set on the
> scheduler, the scheduler pod restarts after num runs have completed. After
> switching to KubernetesExecutor, the scheduler logs:
> [2019-04-26 19:20:43,562] \{{kubernetes_executor.py:770}} INFO - Shutting
> down Kubernetes executor
> However, the scheduler process does not complete. This leads to the scheduler
> pod never restarting and running num_runs again. Resulted in having to roll
> back to CeleryExecutor because if num_runs is -1, the scheduler builds up
> tons of defunct processes, which is eventually making tasks not able to be
> scheduled as the underlying nodes have run out of file descriptors.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)