[ 
https://issues.apache.org/jira/browse/AIRFLOW-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829578#comment-16829578
 ] 

Brian Nutt edited comment on AIRFLOW-4424 at 4/29/19 6:01 PM:
--------------------------------------------------------------

[~ash] Here are some of the logs i've found from the scheduler pod where one of 
the defunct pids is in the scheduler logs

PID: 8612 is found as one of the child defunct processes of the scheduler

 
{noformat}
{"log":"Traceback (most recent call 
last):\n","stream":"stderr","time":"2019-04-26T13:51:28.969860471Z"} {"log":" 
File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 258, in 
_bootstrap\n","stream":"stderr","time":"2019-04-26T13:51:28.969862916Z"} 
{"log":" 
self.run()\n","stream":"stderr","time":"2019-04-26T13:51:28.969865226Z"} 
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 93, 
in run\n","stream":"stderr","time":"2019-04-26T13:51:28.969867223Z"} {"log":" 
self._target(*self._args, 
**self._kwargs)\n","stream":"stderr","time":"2019-04-26T13:51:28.969869499Z"} 
{"log":" File 
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\", 
line 563, in 
helper\n","stream":"stderr","time":"2019-04-26T13:51:28.969871513Z"} {"log":" 
processor_manager.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969873518Z"}
 {"log":" File 
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\", 
line 796, in 
start\n","stream":"stderr","time":"2019-04-26T13:51:28.969875327Z"} {"log":" 
self.start_in_async()\n","stream":"stderr","time":"2019-04-26T13:51:28.969877469Z"}
 {"log":" File 
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\", 
line 819, in 
start_in_async\n","stream":"stderr","time":"2019-04-26T13:51:28.969879213Z"} 
{"log":" simple_dags = 
self.heartbeat()\n","stream":"stderr","time":"2019-04-26T13:51:28.969881187Z"} 
{"log":" File 
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\", 
line 1208, in 
heartbeat\n","stream":"stderr","time":"2019-04-26T13:51:28.96988305Z"} {"log":" 
processor.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969885169Z"} 
{"log":" File \"/usr/local/lib/python3.6/site-packages/airflow/jobs.py\", line 
435, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969886977Z"} 
{"log":" 
self._zombies)\n","stream":"stderr","time":"2019-04-26T13:51:28.969888991Z"} 
{"log":" File \"/usr/local/lib/python3.6/site-packages/airflow/jobs.py\", line 
422, in 
_launch_process\n","stream":"stderr","time":"2019-04-26T13:51:28.969890767Z"} 
{"log":" 
p.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969892715Z"} 
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 
105, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969894479Z"} 
{"log":" self._popen = 
self._Popen(self)\n","stream":"stderr","time":"2019-04-26T13:51:28.969896484Z"} 
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/context.py\", line 
223, in _Popen\n","stream":"stderr","time":"2019-04-26T13:51:28.969898389Z"} 
{"log":" return 
_default_context.get_context().Process._Popen(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.969900517Z"}
 {"log":" File \"/usr/local/lib/python3.6/multiprocessing/context.py\", line 
277, in _Popen\n","stream":"stderr","time":"2019-04-26T13:51:28.969902375Z"} 
{"log":" return 
Popen(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.96990443Z"} 
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/popen_fork.py\", line 
19, in __init__\n","stream":"stderr","time":"2019-04-26T13:51:28.969906219Z"} 
{"log":" 
self._launch(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.969908115Z"}
 {"log":" File \"/usr/local/lib/python3.6/multiprocessing/popen_fork.py\", line 
66, in _launch\n","stream":"stderr","time":"2019-04-26T13:51:28.969909875Z"} 
{"log":" self.pid = 
os.fork()\n","stream":"stderr","time":"2019-04-26T13:51:28.969911766Z"} 
{"log":"BlockingIOError: [Errno 11] Resource temporarily 
unavailable\n","stream":"stderr","time":"2019-04-26T13:51:28.969913559Z"}

{"log":"[2019-04-26 13:51:29,522] dag_processing.py:515 INFO - Launched 
DagFileProcessorManager with pid: 
8612\n","stream":"stdout","time":"2019-04-26T13:51:29.522530422Z"}
{"log":"[2019-04-26 13:51:29,525] kubernetes_executor.py:450 INFO - Attempting 
to finish pod; pod_id: exampledagrunthis-c62f39ca5f654392b3a68e1c5c1fe50e; 
state: None; labels: {'airflow-worker': '217d9fa7-0a29-4126-b2e1-1a640478022a', 
'dag_id': 'example_dag', 'execution_date': '2019-04-26T13_50_00_plus_00_00', 
'task_id': 'run_this', 'try_number': 
'1'}\n","stream":"stdout","time":"2019-04-26T13:51:29.525893283Z"}
{"log":"[2019-04-26 13:51:29,528] settings.py:53 INFO - Configured default 
timezone \u003cTimezone 
[America/Los_Angeles]\u003e\n","stream":"stdout","time":"2019-04-26T13:51:29.528970108Z"}
{"log":"[2019-04-26 13:51:29,539] kubernetes_executor.py:573 INFO - Checking 2 
task instances.\n","stream":"stdout","time":"2019-04-26T13:51:29.53977293Z"}
{"log":"[2019-04-26 13:51:29,539] kubernetes_executor.py:583 INFO - Found 
matching task example_dag-run_this (2019-04-26 13:50:00+00:00) with current 
state of success\n","stream":"stdout","time":"2019-04-26T13:51:29.539923359Z"}
{"log":"[2019-04-26 13:51:29,539] settings.py:182 INFO - 
settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, 
pid=8612\n","stream":"stdout","time":"2019-04-26T13:51:29.540097272Z"}{noformat}

  
  


was (Author: bnutt):
[~ash] Here are some of the logs i've found from the scheduler pod where one of 
the defunct pids is in the scheduler logs

PID: 8612 is found as one of the child defunct processes of the scheduler

 

```
{"log":"Traceback (most recent call 
last):\n","stream":"stderr","time":"2019-04-26T13:51:28.969860471Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 
258, in _bootstrap\n","stream":"stderr","time":"2019-04-26T13:51:28.969862916Z"}
{"log":" 
self.run()\n","stream":"stderr","time":"2019-04-26T13:51:28.969865226Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 93, 
in run\n","stream":"stderr","time":"2019-04-26T13:51:28.969867223Z"}
{"log":" self._target(*self._args, 
**self._kwargs)\n","stream":"stderr","time":"2019-04-26T13:51:28.969869499Z"}
{"log":" File 
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\", 
line 563, in 
helper\n","stream":"stderr","time":"2019-04-26T13:51:28.969871513Z"}
{"log":" 
processor_manager.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969873518Z"}
{"log":" File 
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\", 
line 796, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969875327Z"}
{"log":" 
self.start_in_async()\n","stream":"stderr","time":"2019-04-26T13:51:28.969877469Z"}
{"log":" File 
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\", 
line 819, in 
start_in_async\n","stream":"stderr","time":"2019-04-26T13:51:28.969879213Z"}
{"log":" simple_dags = 
self.heartbeat()\n","stream":"stderr","time":"2019-04-26T13:51:28.969881187Z"}
{"log":" File 
\"/usr/local/lib/python3.6/site-packages/airflow/utils/dag_processing.py\", 
line 1208, in 
heartbeat\n","stream":"stderr","time":"2019-04-26T13:51:28.96988305Z"}
{"log":" 
processor.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969885169Z"}
{"log":" File \"/usr/local/lib/python3.6/site-packages/airflow/jobs.py\", line 
435, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969886977Z"}
{"log":" 
self._zombies)\n","stream":"stderr","time":"2019-04-26T13:51:28.969888991Z"}
{"log":" File \"/usr/local/lib/python3.6/site-packages/airflow/jobs.py\", line 
422, in 
_launch_process\n","stream":"stderr","time":"2019-04-26T13:51:28.969890767Z"}
{"log":" p.start()\n","stream":"stderr","time":"2019-04-26T13:51:28.969892715Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/process.py\", line 
105, in start\n","stream":"stderr","time":"2019-04-26T13:51:28.969894479Z"}
{"log":" self._popen = 
self._Popen(self)\n","stream":"stderr","time":"2019-04-26T13:51:28.969896484Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/context.py\", line 
223, in _Popen\n","stream":"stderr","time":"2019-04-26T13:51:28.969898389Z"}
{"log":" return 
_default_context.get_context().Process._Popen(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.969900517Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/context.py\", line 
277, in _Popen\n","stream":"stderr","time":"2019-04-26T13:51:28.969902375Z"}
{"log":" return 
Popen(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.96990443Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/popen_fork.py\", line 
19, in __init__\n","stream":"stderr","time":"2019-04-26T13:51:28.969906219Z"}
{"log":" 
self._launch(process_obj)\n","stream":"stderr","time":"2019-04-26T13:51:28.969908115Z"}
{"log":" File \"/usr/local/lib/python3.6/multiprocessing/popen_fork.py\", line 
66, in _launch\n","stream":"stderr","time":"2019-04-26T13:51:28.969909875Z"}
{"log":" self.pid = 
os.fork()\n","stream":"stderr","time":"2019-04-26T13:51:28.969911766Z"}
{"log":"BlockingIOError: [Errno 11] Resource temporarily 
unavailable\n","stream":"stderr","time":"2019-04-26T13:51:28.969913559Z"}
{"log":"[2019-04-26 13:51:29,522] \{{dag_processing.py:515}} INFO - Launched 
DagFileProcessorManager with pid: 
8612\n","stream":"stdout","time":"2019-04-26T13:51:29.522530422Z"}
{"log":"[2019-04-26 13:51:29,525] \{{kubernetes_executor.py:450}} INFO - 
Attempting to finish pod; pod_id: 
exampledagrunthis-c62f39ca5f654392b3a68e1c5c1fe50e; state: None; labels: 
\{'airflow-worker': '217d9fa7-0a29-4126-b2e1-1a640478022a', 'dag_id': 
'example_dag', 'execution_date': '2019-04-26T13_50_00_plus_00_00', 'task_id': 
'run_this', 'try_number': 
'1'}\n","stream":"stdout","time":"2019-04-26T13:51:29.525893283Z"}
{"log":"[2019-04-26 13:51:29,528] \{{settings.py:53}} INFO - Configured default 
timezone \u003cTimezone 
[America/Los_Angeles]\u003e\n","stream":"stdout","time":"2019-04-26T13:51:29.528970108Z"}
{"log":"[2019-04-26 13:51:29,539] \{{kubernetes_executor.py:573}} INFO - 
Checking 2 task 
instances.\n","stream":"stdout","time":"2019-04-26T13:51:29.53977293Z"}
{"log":"[2019-04-26 13:51:29,539] \{{kubernetes_executor.py:583}} INFO - Found 
matching task example_dag-run_this (2019-04-26 13:50:00+00:00) with current 
state of success\n","stream":"stdout","time":"2019-04-26T13:51:29.539923359Z"}
{"log":"[2019-04-26 13:51:29,539] \{{settings.py:182}} INFO - 
settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, 
pid=8612\n","stream":"stdout","time":"2019-04-26T13:51:29.540097272Z"}
 ```
 
 

> Scheduler does not terminate after num_runs when executor is 
> KubernetesExecutor
> -------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4424
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4424
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: kubernetes, scheduler
>    Affects Versions: 1.10.3
>         Environment: EKS, deployed with stable airflow helm chart
>            Reporter: Brian Nutt
>            Priority: Blocker
>             Fix For: 1.10.4
>
>
> When using the executor like the CeleryExecutor and num_runs is set on the 
> scheduler, the scheduler pod restarts after num runs have completed. After 
> switching to KubernetesExecutor, the scheduler logs:
> [2019-04-26 19:20:43,562] \{{kubernetes_executor.py:770}} INFO - Shutting 
> down Kubernetes executor
> However, the scheduler process does not complete. This leads to the scheduler 
> pod never restarting and running num_runs again. Resulted in having to roll 
> back to CeleryExecutor because if num_runs is -1, the scheduler builds up 
> tons of defunct processes, which is eventually making tasks not able to be 
> scheduled as the underlying nodes have run out of file descriptors.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to