Re: [D] Only retry KubernetesPodOperator if the pod scheduling fails [airflow]

via GitHub Tue, 26 Nov 2024 13:35:43 -0800


GitHub user namelessjon edited a discussion: Only retry KubernetesPodOperator 
if the pod scheduling fails


Hi

I have a bunch of jobs using the KubernetesPodOperator (via EKS, but that 
doesn't seem relevant to this discussion). Pod scheduling in our cluster is a 
little unreliable, so I set retries for the operator via airflow to avoid 
transient failures. So far, so good.

However, independently of this, the jobs running in the pods sometimes fail 
(generally due to bad inputs somewhere upstream, but regardless of the reason, 
these failures are very unlikely to resolve via trying again). Due to the 
retries policy, they will still retry, which delays notifications of errors and 
results in pointless effort in running the task.

How can I configure things so that if the pod fails to schedule it will be 
retried, but if the task runs to completion, a non-zero exit code will be 
treated as if an `AirflowFailException` was raised?

Suggestions gratefully accepted!

EDIT: from some investigation, it seems the handling of the exit code all 
happens in the 
[https://github.com/apache/airflow/blob/main/providers/src/airflow/providers/cncf/kubernetes/operators/pod.py#L851](cleanup)
 method, with no real option to interfere. But is that correct?

GitHub link: https://github.com/apache/airflow/discussions/44390

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Only retry KubernetesPodOperator if the pod scheduling fails [airflow]

Reply via email to