[
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978151#comment-16978151
]
afusr commented on AIRFLOW-6014:
--------------------------------
I've created a PR which should catch these deleted pods and mark them to be up
for reschedule.
We have looked at taints, it seems the ones applied to the k8s node by the gke
autoscaler when the node spins up don't prevent Airflow pods from being
schedule on there before all system pods have started.
You could perhaps create some kind of watch process, to look for newly created
nodes, apply a taint and wait for the system pods to start. But you would then
have to ensure any system pods you want on there have a toleration added to
their spec to ensure they are able to start. Once the system pods are up you
could then remove the taint and allow airflow pods to be placed there.
It's interesting as to why k8s is creating a state where this can happen in the
first place. My guess is whilst the new node is starting, multiple airflow
tasks backup and are waiting to be scheduled. Once it is ready, the k8s
scheduler selects a number of airflow pods, and looking at the pod memory
request value, decides they will all fit on the new node. Then perhaps it also
tries to schedule any daemon sets on there, as these must be present and have a
higher priority, they force a random airflow pod to be preempted and it is then
deleted from the node.
There is a similar issue described in this openshift bug report, particularly
this comment [https://bugzilla.redhat.com/show_bug.cgi?id=1701046#c13]
The most straight forward approach I think is to just ensure that if a pod is
pending, and it is then deleted, mark it as up for reschedule, as the linked PR
should do. Airflow then appears (from testing) to relaunch the pod and not
affect the retry limit for the task.
> Kubernetes executor - handle preempted deleted pods - queued tasks
> ------------------------------------------------------------------
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
> Issue Type: Improvement
> Components: executor-kubernetes
> Affects Versions: 1.10.6
> Reporter: afusr
> Assignee: Daniel Imberman
> Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and
> using autoscaling, airflow pods are preempted and airflow never attempts to
> rerun these pods.
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want
> kubernetes to retry it, as this should be controlled by airflow.
> What we believe happens is that when a new node is added by autoscaling,
> kubernetes schedules a number of airflow pods onto the new node, as well as
> any pods required by k8s/daemon sets. As these are higher priority, the
> Airflow pods are preempted, and deleted. You see messages such as:
>
> Preempted by kube-system/ip-masq-agent-xz77q on node
> gke-some--airflow-00000000-node-1ltl
>
> Within the kubernetes executor, these pods end up in a status of pending and
> an event of deleted is received but not handled.
> The end result is tasks remain in a queued state forever.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)