mrpowerus opened a new issue #14974:
URL: https://github.com/apache/airflow/issues/14974


   **Apache Airflow version**: 2.0.0 and 2.0.1
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl 
version`): 1.18.4 (AKS)
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: Azure Cloud
   - **OS** (e.g. from /etc/os-release): Debian GNU/Linux 10 (Buster)
   - **Kernel** (e.g. `uname -a`): Linux airflow-scheduler-5cf464667c-7zd6j 
5.4.0-1040-azure #42~18.04.1-Ubuntu SMP Mon Feb 8 19:05:32 UTC 2021 x86_64 
GNU/Linux
   - **Others**: Image apache/airflow:2.0.1
   
   **What happened**:
   
   KubernetesJobwatcher does not delete Worker Pods after they are assigned the 
_'status.phase=Succeeded'_. But this only happens after 30-ish minutes of 
complete inactivity of the Kubernetes Cluster.
   
   **What you expected to happen**:
   
   The KubernetesJobWatcher should delete Worker Pods after they have been 
successful at any time. As my config states (I verfied this with `airflow 
config`:
   ```
       [kubernetes]
       pod_template_file = /opt/airflow/pod_template_file.yaml
       worker_container_repository = apache/airflow
       worker_container_tag = 2.0.1-python3.8
       namespace = {{ .Values.role.namespace }}
       delete_worker_pods = True
       delete_worker_pods_on_failure = False
   
   ```
   <!-- What do you think went wrong? -->
   
   The Executor tries over-and-over again to adopt completed pods in:
   
https://github.com/apache/airflow/blob/master/airflow/executors/kubernetes_executor.py#L645
   This is successful. However, the Pods are not cleaned by the 
KubernetesJobWatcher as no logging of the watcher appears. (I would expect 
logging from this line 
https://github.com/apache/airflow/blob/v2-0-stable/airflow/executors/kubernetes_executor.py#L147)
   
   After some digging, I think the watch.stream() from `from kubernetes import 
client, watch` which is called in 
https://github.com/apache/airflow/blob/v2-0-stable/airflow/executors/kubernetes_executor.py#L143.
 Dies after a long time of complete inactivity. This is also explicitly 
mentioned in the docstring at: 
https://github.com/kubernetes-client/python-base/blob/master/watch/watch.py#L115
   
   I think Airflow should be able to recover from this issue automatically. 
Otherwise I should run a dummy task each 30-ish minutes or so, just to keep the 
kubernetes.watch.stream() alive. Which seems pointless to me.
   
   **How to reproduce it**:
   Run Airflow 2+ in a Kubernetes cluster which has no activity at all for 
30-ish minutes. Then start an operator. The Kubernetes Worker will not be 
deleted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to