Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

Christopher Bockman Sun, 17 Dec 2017 12:38:37 -0800

Hmm, perhaps we've just had a couple of bad/unlucky runs but, in general,
the underlying task-kill process doesn't really seem to work, from what
we've seen.  I would guess this is related to
https://issues.apache.org/jira/browse/AIRFLOW-1623.




On Sun, Dec 17, 2017 at 12:22 PM, Bolke de Bruin <[email protected]> wrote:

> Shorter heartbeats, you might still have some tasks being scheduled
> nevertheless due to the time window. However, if the tasks detects it is
> running somewhere else, it should also terminate itself.
>
> [scheduler]
> # Task instances listen for external kill signal (when you clear tasks
> # from the CLI or the UI), this defines the frequency at which they should
> # listen (in seconds).
> job_heartbeat_sec = 5
>
> Bolke.
>
>
> > On 17 Dec 2017, at 20:59, Christopher Bockman <[email protected]>
> wrote:
> >
> >> P.S. I am assuming that you are talking about your scheduler going down,
> > not workers
> >
> > Correct (and, in some unfortunate scenarios, everything else...)
> >
> >> Normally a task will detect (on the heartbeat interval) whether its
> state
> > was changed externally and will terminate itself.
> >
> > Hmm, that would be an acceptable solution, but this doesn't
> (automatically,
> > in our current configuration) occur.  How can we encourage this behavior
> to
> > happen?
> >
> >
> > On Sun, Dec 17, 2017 at 11:47 AM, Bolke de Bruin <[email protected]>
> wrote:
> >
> >> Quite important to know is, is that Airflow’s executors do not keep
> state
> >> after a restart. This particularly affects distributed executors
> (celery,
> >> dask) as the workers are independent from the scheduler. Thus at
> restart we
> >> reset all the tasks in the queued state that the executor does not know
> >> about, which means all of them at the moment. Due to the distributed
> nature
> >> of the executors, tasks can still be running. Normally a task will
> detect
> >> (on the heartbeat interval) whether its state was changed externally and
> >> will terminate itself.
> >>
> >> I have done some work some months ago to make the executor keep state
> over
> >> restarts, but never got around to finish it.
> >>
> >> So at the moment, to prevent requeuing, you need to make the airflow
> >> scheduler no go down (as much).
> >>
> >> Bolke.
> >>
> >> P.S. I am assuming that you are talking about your scheduler going down,
> >> not workers
> >>
> >>> On 17 Dec 2017, at 20:07, Christopher Bockman <[email protected]>
> >> wrote:
> >>>
> >>> Upon further internal discussion, we might be seeing the task cloning
> >>> because the postgres DB is getting into a corrupted state...but
> unclear.
> >>> If consensus is we *shouldn't* be seeing this behavior, even as-is,
> we'll
> >>> push more on that angle.
> >>>
> >>> On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <
> >> [email protected]
> >>>> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
> >>>> something as simple as the underlying infrastructure going down).
> >>>>
> >>>> Currently, we run everything on Kubernetes (including Airflow), so the
> >>>> Airflow pods crashes generally will be detected, and then they will
> >> restart.
> >>>>
> >>>> However, if we have, e.g., a DAG that is running task X when it
> crashes,
> >>>> when Airflow comes back up, it apparently sees task X didn't complete,
> >> so
> >>>> it restarts the task (which, in this case, means it spins up an
> entirely
> >>>> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
> >>>> simultaneously.
> >>>>
> >>>> Is there any (out of the box) way to better connect up state between
> >> tasks
> >>>> and Airflow to prevent this?
> >>>>
> >>>> (For additional context, we currently execute Kubernetes jobs via a
> >> custom
> >>>> operator that basically layers on top of BashOperator...perhaps the
> new
> >>>> Kubernetes operator will help address this?)
> >>>>
> >>>> Thank you in advance for any thoughts,
> >>>>
> >>>> Chris
> >>>>
> >>
> >>
>
>

Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

Reply via email to