afusr created AIRFLOW-6014:
------------------------------
Summary: Kubernetes executor - handle preempted deleted pods -
queued tasks
Key: AIRFLOW-6014
URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
Project: Apache Airflow
Issue Type: Improvement
Components: executor-kubernetes
Affects Versions: 1.10.6
Reporter: afusr
Assignee: Daniel Imberman
We have encountered an issue whereby when using the kubernetes executor, and
using autoscaling, airflow pods are preempted and airflow never attempts to
rerun these pods.
This is partly as a result of having the following set on the pod spec:
restartPolicy: Never
This makes sense as if a pod fails when running a task, we don't want
kubernetes to retry it, as this should be controlled by airflow.
What we believe happens is that when a new node is added by autoscaling,
kubernetes schedules a number of airflow pods onto the new, as well as any pods
required by k8s/daemon sets. As these are higher priority, the Airflow pods are
preempted, and deleted. You see messages such as:
Preempted by kube-system/ip-masq-agent-xz77q on node
gke-some--airflow-00000000-node-1ltl
Within the kubernetes executor, these pods end up in a status of pending and an
event of deleted is received by not handled.
The end result is tasks remain in a queued state forever.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)