[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-08-12 Thread Lindsay Portelli (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176404#comment-17176404
 ] 

Lindsay Portelli commented on AIRFLOW-6014:
---

I am facing the same issue as [~kiruthiga24]. I was able to mitigate the issue 
of the failed pods by increasing the number of retries, but cannot figure out 
how to best clear the tasks stuck in the queued state.

> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
> Attachments: image-2020-07-14-11-27-21-277.png, 
> image-2020-07-14-11-29-14-334.png
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-07-14 Thread Kiruthiga (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157168#comment-17157168
 ] 

Kiruthiga commented on AIRFLOW-6014:


I am facing the same issue.

The task pods that are preempted by Kubernetes to accommodate critical system 
pods are marked as "queued" or " failed" in the Airflow.I am concentrating on 
the queued task as i am not sure why task is failing on preemption.

In my case, the kubernetes scheduler marks the task(preempted pod) for 
"up_for_reschedule" state but the same is not updated in the Airflow 
database/Webserver UI.

Attaching the screenshots for reference.

*Kubernetes Scheduler Log*

!image-2020-07-14-11-27-21-277.png!

*Airflow Webserver* - the task *sleep_for_1* is still in queued state(expected 
state is "up_for_reschedule")

!image-2020-07-14-11-29-14-334.png!

 

I have started debugging the Airflow code. The mentioned log(screenshot 1 - 
from kubernetes scheduler) is from airflow/jobs/*scheduer_job*.py file, method 
*_process_executor_events*. I doubt the state *State.UP_FOR_RESCHEDULE* is not 
handled in this method.

 

Please correct me if i my understanding is wrong and help me in fixing this 
issue.

 

> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
> Attachments: image-2020-07-14-11-27-21-277.png, 
> image-2020-07-14-11-29-14-334.png
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064323#comment-17064323
 ] 

ASF subversion and git services commented on AIRFLOW-6014:
--

Commit 17f0eb15ba1d2766eb673e1c846f3d278207cd0a in airflow's branch 
refs/heads/v1-10-test from atrbgithub
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=17f0eb1 ]

[AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not 
restarted (#6606)

cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064142#comment-17064142
 ] 

ASF subversion and git services commented on AIRFLOW-6014:
--

Commit 1ec4b7ad9c2d07188d0ed2b3882cdc434c625830 in airflow's branch 
refs/heads/v1-10-test from atrbgithub
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=1ec4b7a ]

[AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not 
restarted (#6606)

cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062519#comment-17062519
 ] 

ASF subversion and git services commented on AIRFLOW-6014:
--

Commit 9de592b8048460cc84a84cbeb360356e2ac05b14 in airflow's branch 
refs/heads/v1-10-test from atrbgithub
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=9de592b ]

[AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not 
restarted (#6606)

cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062501#comment-17062501
 ] 

ASF subversion and git services commented on AIRFLOW-6014:
--

Commit f9177b0a55c67a9694699ff4e861186fa869201b in airflow's branch 
refs/heads/v1-10-test from atrbgithub
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=f9177b0 ]

[AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not 
restarted (#6606)

cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062264#comment-17062264
 ] 

ASF subversion and git services commented on AIRFLOW-6014:
--

Commit 5c2ebe2f87e08ff4cc8c2aba1c36c5411536b9d5 in airflow's branch 
refs/heads/v1-10-test from atrbgithub
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=5c2ebe2 ]

[AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not 
restarted (#6606)

cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062221#comment-17062221
 ] 

ASF subversion and git services commented on AIRFLOW-6014:
--

Commit 2ae99e145374655c87068bce48e91f07a6567242 in airflow's branch 
refs/heads/v1-10-test from atrbgithub
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=2ae99e1 ]

[AIRFLOW-6014] Handle pods which are preempted & deleted by kubernetes but not 
restarted (#6606)

cherry-picked from 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061418#comment-17061418
 ] 

ASF subversion and git services commented on AIRFLOW-6014:
--

Commit 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 in airflow's branch 
refs/heads/master from atrbgithub
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=4e626be ]

[AIRFLOW-6014] - handle pods which are preempted and deleted by kuber… (#6606)

* [AIRFLOW-6014] - handle pods which are preempted and deleted by kubernetes 
but not restarted

> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061416#comment-17061416
 ] 

ASF GitHub Bot commented on AIRFLOW-6014:
-

potiuk commented on pull request #6606: [AIRFLOW-6014] - handle pods which are 
preempted and deleted by kuber…
URL: https://github.com/apache/airflow/pull/6606
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061417#comment-17061417
 ] 

ASF subversion and git services commented on AIRFLOW-6014:
--

Commit 4e626be3c90d76fac7ffc3a6b5c6fed10753fd38 in airflow's branch 
refs/heads/master from atrbgithub
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=4e626be ]

[AIRFLOW-6014] - handle pods which are preempted and deleted by kuber… (#6606)

* [AIRFLOW-6014] - handle pods which are preempted and deleted by kubernetes 
but not restarted

> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
> Fix For: 1.10.10
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060912#comment-17060912
 ] 

ASF GitHub Bot commented on AIRFLOW-6014:
-

inytar commented on pull request #7611: [AIRFLOW-6014] reschedule deleted 
pending tasks
URL: https://github.com/apache/airflow/pull/7611
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048694#comment-17048694
 ] 

ASF GitHub Bot commented on AIRFLOW-6014:
-

atrbgithub commented on pull request #6606: [AIRFLOW-6014] - handle pods which 
are preempted and deleted by kuber…
URL: https://github.com/apache/airflow/pull/6606
 
 
   …netes but not restarted
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW-6014) issues and references 
them in the PR title. 
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI 
changes:
   This PR addresses the issue of when a pod is Preempted during the creation 
phase and due to pods having the following in the spec ```restartPolicy: 
Never``` The pod is never restarted and ends up as a queued task within Airflow 
until the scheduler is restarted.  
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   Unsure if it is possible to simulate this scenario. 
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2020-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028246#comment-17028246
 ] 

ASF GitHub Bot commented on AIRFLOW-6014:
-

stale[bot] commented on pull request #6606: [AIRFLOW-6014] - handle pods which 
are preempted and deleted by kuber…
URL: https://github.com/apache/airflow/pull/6606
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2019-12-02 Thread afusr (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986672#comment-16986672
 ] 

afusr commented on AIRFLOW-6014:


[~dimberman] Do you think this can be merged? This fixed has saved us a number 
of times now, by catching pods which have failed to start, and rescheduling 
them. 

> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2019-11-19 Thread afusr (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978151#comment-16978151
 ] 

afusr commented on AIRFLOW-6014:


I've created a PR which should catch these deleted pods and mark them to be up 
for reschedule. 

We have looked at taints, it seems the ones applied to the k8s node by the gke 
autoscaler when the node spins up don't prevent Airflow pods from being 
schedule on there before all system pods have started. 

You could perhaps create some kind of watch process, to look for newly created 
nodes, apply a taint and wait for the system pods to start. But you would then 
have to ensure any system pods you want on there have a toleration added to 
their spec to ensure they are able to start. Once the system pods are up you 
could then remove the taint and allow airflow pods to be placed there. 

It's interesting as to why k8s is creating a state where this can happen in the 
first place. My guess is whilst the new node is starting, multiple airflow 
tasks backup and are waiting to be scheduled. Once it is ready, the k8s 
scheduler selects a number of airflow pods, and looking at the pod memory 
request value, decides they will all fit on the new node. Then perhaps it also 
tries to schedule any daemon sets on there, as these must be present and have a 
higher priority, they force a random airflow pod to be preempted and it is then 
deleted from the node. 

There is a similar issue described in this openshift bug report, particularly 
this comment [https://bugzilla.redhat.com/show_bug.cgi?id=1701046#c13]

The most straight forward approach I think is to just ensure that if a pod is 
pending, and it is then deleted, mark it as up for reschedule, as the linked PR 
should do. Airflow then appears (from testing) to relaunch the pod and not 
affect the retry limit for the task. 

 

 

 

 

> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2019-11-19 Thread Daniel Imberman (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977405#comment-16977405
 ] 

Daniel Imberman commented on AIRFLOW-6014:
--

Hmmm this is an interesting one. Do you have any thoughts on what the 
solution could be? Maybe there's some kind of taint we can put on the pods to 
prevent them being moved?

> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received by not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2019-11-19 Thread afusr (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977392#comment-16977392
 ] 

afusr commented on AIRFLOW-6014:


The following PR has been raised as a temp to resolve this scenario, 
[https://github.com/apache/airflow/pull/6606] 

It sets the state of the task to be UP_FOR_RESCHEDULE, which when testing 
results in the pod being rescheduled, but this does not affect the retry count 
for the task. This should be the case as the task has not yet had a chance to 
run, if it is still in a Pending state, and has been deleted, as it never 
transitioned to a state of Running. 

> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received by not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

2019-11-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977386#comment-16977386
 ] 

ASF GitHub Bot commented on AIRFLOW-6014:
-

atrbgithub commented on pull request #6606: [AIRFLOW-6014] - handle pods which 
are preempted and deleted by kuber…
URL: https://github.com/apache/airflow/pull/6606
 
 
   …netes but not restarted
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW-6014) issues and references 
them in the PR title. 
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI 
changes:
   This PR addresses the issue of when a pod is Preempted during the creation 
phase and due to pods having the following in the spec ```restartPolicy: 
Never``` The pod is never restarted and ends up as a queued task within Airflow 
until the scheduler is restarted.  
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   Unsure if it is possible to simulate this scenario. 
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Kubernetes executor - handle preempted deleted pods - queued tasks
> --
>
> Key: AIRFLOW-6014
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executor-kubernetes
>Affects Versions: 1.10.6
>Reporter: afusr
>Assignee: Daniel Imberman
>Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow--node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received by not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)