[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977392#comment-16977392
 ] 

afusr commented on AIRFLOW-6014:
--------------------------------

The following PR has been raised as a temp to resolve this scenario, 
[https://github.com/apache/airflow/pull/6606] 

It sets the state of the task to be UP_FOR_RESCHEDULE, which when testing 
results in the pod being rescheduled, but this does not affect the retry count 
for the task. This should be the case as the task has not yet had a chance to 
run, if it is still in a Pending state, and has been deleted, as it never 
transitioned to a state of Running. 

> Kubernetes executor - handle preempted deleted pods - queued tasks
> ------------------------------------------------------------------
>
>                 Key: AIRFLOW-6014
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: executor-kubernetes
>    Affects Versions: 1.10.6
>            Reporter: afusr
>            Assignee: Daniel Imberman
>            Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow-00000000-node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received by not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to