Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

Christopher Bockman Sun, 17 Dec 2017 11:08:40 -0800

Upon further internal discussion, we might be seeing the task cloning
because the postgres DB is getting into a corrupted state...but unclear.
If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll
push more on that angle.


On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <ch...@fathomhealth.co
> wrote:

> Hi all,
>
> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
> something as simple as the underlying infrastructure going down).
>
> Currently, we run everything on Kubernetes (including Airflow), so the
> Airflow pods crashes generally will be detected, and then they will restart.
>
> However, if we have, e.g., a DAG that is running task X when it crashes,
> when Airflow comes back up, it apparently sees task X didn't complete, so
> it restarts the task (which, in this case, means it spins up an entirely
> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
> simultaneously.
>
> Is there any (out of the box) way to better connect up state between tasks
> and Airflow to prevent this?
>
> (For additional context, we currently execute Kubernetes jobs via a custom
> operator that basically layers on top of BashOperator...perhaps the new
> Kubernetes operator will help address this?)
>
> Thank you in advance for any thoughts,
>
> Chris
>

Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

Reply via email to