GitHub user namelessjon edited a discussion: Only retry KubernetesPodOperator if the pod scheduling fails
Hi I have a bunch of jobs using the KubernetesPodOperator (via EKS, but that doesn't seem relevant to this discussion). Pod scheduling in our cluster is a little unreliable, so I set retries for the operator via airflow to avoid transient failures. So far, so good. However, independently of this, the jobs running in the pods sometimes fail (generally due to bad inputs somewhere upstream, but regardless of the reason, these failures are very unlikely to resolve via trying again). Due to the retries policy, they will still retry, which delays notifications of errors and results in pointless effort in running the task. How can I configure things so that if the pod fails to schedule it will be retried, but if the task runs to completion, a non-zero exit code will be treated as if an `AirflowFailException` was raised? Suggestions gratefully accepted! EDIT: from some investigation, it seems the handling of the exit code all happens in the [https://github.com/apache/airflow/blob/main/providers/src/airflow/providers/cncf/kubernetes/operators/pod.py#L851](cleanup) method, with no real option to interfere. But is that correct? GitHub link: https://github.com/apache/airflow/discussions/44390 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
