afusr created AIRFLOW-6014:
------------------------------

             Summary: Kubernetes executor - handle preempted deleted pods - 
queued tasks
                 Key: AIRFLOW-6014
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
             Project: Apache Airflow
          Issue Type: Improvement
          Components: executor-kubernetes
    Affects Versions: 1.10.6
            Reporter: afusr
            Assignee: Daniel Imberman


We have encountered an issue whereby when using the kubernetes executor, and 
using autoscaling, airflow pods are preempted and airflow never attempts to 
rerun these pods. 

This is partly as a result of having the following set on the pod spec:

restartPolicy: Never

This makes sense as if a pod fails when running a task, we don't want 
kubernetes to retry it, as this should be controlled by airflow. 

What we believe happens is that when a new node is added by autoscaling, 
kubernetes schedules a number of airflow pods onto the new, as well as any pods 
required by k8s/daemon sets. As these are higher priority, the Airflow pods are 
preempted, and deleted. You see messages such as:

 

Preempted by kube-system/ip-masq-agent-xz77q on node 
gke-some--airflow-00000000-node-1ltl

 

Within the kubernetes executor, these pods end up in a status of pending and an 
event of deleted is received by not handled. 

The end result is tasks remain in a queued state forever. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to