[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176404#comment-17176404 ] Lindsay Portelli commented on AIRFLOW-6014: --- I am facing the same issue as [~kiruthiga24]. I was able to mitigate the issue of the failed pods by increasing the number of retries, but cannot figure out how to best clear the tasks stuck in the queued state. > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > Attachments: image-2020-07-14-11-27-21-277.png, > image-2020-07-14-11-29-14-334.png > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157168#comment-17157168 ] Kiruthiga commented on AIRFLOW-6014: I am facing the same issue. The task pods that are preempted by Kubernetes to accommodate critical system pods are marked as "queued" or " failed" in the Airflow.I am concentrating on the queued task as i am not sure why task is failing on preemption. In my case, the kubernetes scheduler marks the task(preempted pod) for "up_for_reschedule" state but the same is not updated in the Airflow database/Webserver UI. Attaching the screenshots for reference. *Kubernetes Scheduler Log* !image-2020-07-14-11-27-21-277.png! *Airflow Webserver* - the task *sleep_for_1* is still in queued state(expected state is "up_for_reschedule") !image-2020-07-14-11-29-14-334.png! I have started debugging the Airflow code. The mentioned log(screenshot 1 - from kubernetes scheduler) is from airflow/jobs/*scheduer_job*.py file, method *_process_executor_events*. I doubt the state *State.UP_FOR_RESCHEDULE* is not handled in this method. Please correct me if i my understanding is wrong and help me in fixing this issue. > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > Attachments: image-2020-07-14-11-27-21-277.png, > image-2020-07-14-11-29-14-334.png > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064323#comment-17064323 ] ASF subversion and git services commented on AIRFLOW-6014: -- Commit 17f0eb15ba1d2766eb673e1c846f3d278207cd0a in airflow's branch refs/heads/v1-10-test from atrbgithub [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=17f0eb1 ] [AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not restarted (#6606) cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064142#comment-17064142 ] ASF subversion and git services commented on AIRFLOW-6014: -- Commit 1ec4b7ad9c2d07188d0ed2b3882cdc434c625830 in airflow's branch refs/heads/v1-10-test from atrbgithub [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=1ec4b7a ] [AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not restarted (#6606) cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062519#comment-17062519 ] ASF subversion and git services commented on AIRFLOW-6014: -- Commit 9de592b8048460cc84a84cbeb360356e2ac05b14 in airflow's branch refs/heads/v1-10-test from atrbgithub [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=9de592b ] [AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not restarted (#6606) cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062501#comment-17062501 ] ASF subversion and git services commented on AIRFLOW-6014: -- Commit f9177b0a55c67a9694699ff4e861186fa869201b in airflow's branch refs/heads/v1-10-test from atrbgithub [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=f9177b0 ] [AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not restarted (#6606) cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062264#comment-17062264 ] ASF subversion and git services commented on AIRFLOW-6014: -- Commit 5c2ebe2f87e08ff4cc8c2aba1c36c5411536b9d5 in airflow's branch refs/heads/v1-10-test from atrbgithub [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=5c2ebe2 ] [AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not restarted (#6606) cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062221#comment-17062221 ] ASF subversion and git services commented on AIRFLOW-6014: -- Commit 2ae99e145374655c87068bce48e91f07a6567242 in airflow's branch refs/heads/v1-10-test from atrbgithub [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=2ae99e1 ] [AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not restarted (#6606) cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061418#comment-17061418 ] ASF subversion and git services commented on AIRFLOW-6014: -- Commit 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 in airflow's branch refs/heads/master from atrbgithub [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=4e626be ] [AIRFLOW-6014] - handle pods which are preempted and deleted by kuber… (#6606) * [AIRFLOW-6014] - handle pods which are preempted and deleted by kubernetes but not restarted > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061416#comment-17061416 ] ASF GitHub Bot commented on AIRFLOW-6014: - potiuk commented on pull request #6606: [AIRFLOW-6014] - handle pods which are preempted and deleted by kuber… URL: https://github.com/apache/airflow/pull/6606 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061417#comment-17061417 ] ASF subversion and git services commented on AIRFLOW-6014: -- Commit 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 in airflow's branch refs/heads/master from atrbgithub [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=4e626be ] [AIRFLOW-6014] - handle pods which are preempted and deleted by kuber… (#6606) * [AIRFLOW-6014] - handle pods which are preempted and deleted by kubernetes but not restarted > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > Fix For: 1.10.10 > > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060912#comment-17060912 ] ASF GitHub Bot commented on AIRFLOW-6014: - inytar commented on pull request #7611: [AIRFLOW-6014] reschedule deleted pending tasks URL: https://github.com/apache/airflow/pull/7611 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048694#comment-17048694 ] ASF GitHub Bot commented on AIRFLOW-6014: - atrbgithub commented on pull request #6606: [AIRFLOW-6014] - handle pods which are preempted and deleted by kuber… URL: https://github.com/apache/airflow/pull/6606 …netes but not restarted Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW-6014) issues and references them in the PR title. ### Description - [x] Here are some details about my PR, including screenshots of any UI changes: This PR addresses the issue of when a pod is Preempted during the creation phase and due to pods having the following in the spec ```restartPolicy: Never``` The pod is never restarted and ends up as a queued task within Airflow until the scheduler is restarted. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Unsure if it is possible to simulate this scenario. ### Commits - [x] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [x] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028246#comment-17028246 ] ASF GitHub Bot commented on AIRFLOW-6014: - stale[bot] commented on pull request #6606: [AIRFLOW-6014] - handle pods which are preempted and deleted by kuber… URL: https://github.com/apache/airflow/pull/6606 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986672#comment-16986672 ] afusr commented on AIRFLOW-6014: [~dimberman] Do you think this can be merged? This fixed has saved us a number of times now, by catching pods which have failed to start, and rescheduling them. > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978151#comment-16978151 ] afusr commented on AIRFLOW-6014: I've created a PR which should catch these deleted pods and mark them to be up for reschedule. We have looked at taints, it seems the ones applied to the k8s node by the gke autoscaler when the node spins up don't prevent Airflow pods from being schedule on there before all system pods have started. You could perhaps create some kind of watch process, to look for newly created nodes, apply a taint and wait for the system pods to start. But you would then have to ensure any system pods you want on there have a toleration added to their spec to ensure they are able to start. Once the system pods are up you could then remove the taint and allow airflow pods to be placed there. It's interesting as to why k8s is creating a state where this can happen in the first place. My guess is whilst the new node is starting, multiple airflow tasks backup and are waiting to be scheduled. Once it is ready, the k8s scheduler selects a number of airflow pods, and looking at the pod memory request value, decides they will all fit on the new node. Then perhaps it also tries to schedule any daemon sets on there, as these must be present and have a higher priority, they force a random airflow pod to be preempted and it is then deleted from the node. There is a similar issue described in this openshift bug report, particularly this comment [https://bugzilla.redhat.com/show_bug.cgi?id=1701046#c13] The most straight forward approach I think is to just ensure that if a pod is pending, and it is then deleted, mark it as up for reschedule, as the linked PR should do. Airflow then appears (from testing) to relaunch the pod and not affect the retry limit for the task. > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received but not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977405#comment-16977405 ] Daniel Imberman commented on AIRFLOW-6014: -- Hmmm this is an interesting one. Do you have any thoughts on what the solution could be? Maybe there's some kind of taint we can put on the pods to prevent them being moved? > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received by not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977392#comment-16977392 ] afusr commented on AIRFLOW-6014: The following PR has been raised as a temp to resolve this scenario, [https://github.com/apache/airflow/pull/6606] It sets the state of the task to be UP_FOR_RESCHEDULE, which when testing results in the pod being rescheduled, but this does not affect the retry count for the task. This should be the case as the task has not yet had a chance to run, if it is still in a Pending state, and has been deleted, as it never transitioned to a state of Running. > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received by not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977386#comment-16977386 ] ASF GitHub Bot commented on AIRFLOW-6014: - atrbgithub commented on pull request #6606: [AIRFLOW-6014] - handle pods which are preempted and deleted by kuber… URL: https://github.com/apache/airflow/pull/6606 …netes but not restarted Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW-6014) issues and references them in the PR title. ### Description - [x] Here are some details about my PR, including screenshots of any UI changes: This PR addresses the issue of when a pod is Preempted during the creation phase and due to pods having the following in the spec ```restartPolicy: Never``` The pod is never restarted and ends up as a queued task within Airflow until the scheduler is restarted. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Unsure if it is possible to simulate this scenario. ### Commits - [x] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [x] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Kubernetes executor - handle preempted deleted pods - queued tasks > -- > > Key: AIRFLOW-6014 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6014 > Project: Apache Airflow > Issue Type: Improvement > Components: executor-kubernetes >Affects Versions: 1.10.6 >Reporter: afusr >Assignee: Daniel Imberman >Priority: Minor > > We have encountered an issue whereby when using the kubernetes executor, and > using autoscaling, airflow pods are preempted and airflow never attempts to > rerun these pods. > This is partly as a result of having the following set on the pod spec: > restartPolicy: Never > This makes sense as if a pod fails when running a task, we don't want > kubernetes to retry it, as this should be controlled by airflow. > What we believe happens is that when a new node is added by autoscaling, > kubernetes schedules a number of airflow pods onto the new node, as well as > any pods required by k8s/daemon sets. As these are higher priority, the > Airflow pods are preempted, and deleted. You see messages such as: > > Preempted by kube-system/ip-masq-agent-xz77q on node > gke-some--airflow--node-1ltl > > Within the kubernetes executor, these pods end up in a status of pending and > an event of deleted is received by not handled. > The end result is tasks remain in a queued state forever. > -- This message was sent by Atlassian Jira (v8.3.4#803005)