mrpowerus opened a new issue #14974:
URL: https://github.com/apache/airflow/issues/14974
**Apache Airflow version**: 2.0.0 and 2.0.1
**Kubernetes version (if you are using kubernetes)** (use `kubectl
version`): 1.18.4 (AKS)
**Environment**:
- **Cloud provider or hardware configuration**: Azure Cloud
- **OS** (e.g. from /etc/os-release): Debian GNU/Linux 10 (Buster)
- **Kernel** (e.g. `uname -a`): Linux airflow-scheduler-5cf464667c-7zd6j
5.4.0-1040-azure #42~18.04.1-Ubuntu SMP Mon Feb 8 19:05:32 UTC 2021 x86_64
GNU/Linux
- **Others**: Image apache/airflow:2.0.1
**What happened**:
KubernetesJobwatcher does not delete Worker Pods after they are assigned the
_'status.phase=Succeeded'_. But this only happens after 30-ish minutes of
complete inactivity of the Kubernetes Cluster.
**What you expected to happen**:
The KubernetesJobWatcher should delete Worker Pods after they have been
successful at any time. As my config states (I verfied this with `airflow
config`:
```
[kubernetes]
pod_template_file = /opt/airflow/pod_template_file.yaml
worker_container_repository = apache/airflow
worker_container_tag = 2.0.1-python3.8
namespace = {{ .Values.role.namespace }}
delete_worker_pods = True
delete_worker_pods_on_failure = False
```
<!-- What do you think went wrong? -->
The Executor tries over-and-over again to adopt completed pods in:
https://github.com/apache/airflow/blob/master/airflow/executors/kubernetes_executor.py#L645
This is successful. However, the Pods are not cleaned by the
KubernetesJobWatcher as no logging of the watcher appears. (I would expect
logging from this line
https://github.com/apache/airflow/blob/v2-0-stable/airflow/executors/kubernetes_executor.py#L147)
After some digging, I think the watch.stream() from `from kubernetes import
client, watch` which is called in
https://github.com/apache/airflow/blob/v2-0-stable/airflow/executors/kubernetes_executor.py#L143.
Dies after a long time of complete inactivity. This is also explicitly
mentioned in the docstring at:
https://github.com/kubernetes-client/python-base/blob/master/watch/watch.py#L115
I think Airflow should be able to recover from this issue automatically.
Otherwise I should run a dummy task each 30-ish minutes or so, just to keep the
kubernetes.watch.stream() alive. Which seems pointless to me.
**How to reproduce it**:
Run Airflow 2+ in a Kubernetes cluster which has no activity at all for
30-ish minutes. Then start an operator. The Kubernetes Worker will not be
deleted.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]