Hmm, perhaps we've just had a couple of bad/unlucky runs but, in general, the underlying task-kill process doesn't really seem to work, from what we've seen. I would guess this is related to https://issues.apache.org/jira/browse/AIRFLOW-1623.
On Sun, Dec 17, 2017 at 12:22 PM, Bolke de Bruin <[email protected]> wrote: > Shorter heartbeats, you might still have some tasks being scheduled > nevertheless due to the time window. However, if the tasks detects it is > running somewhere else, it should also terminate itself. > > [scheduler] > # Task instances listen for external kill signal (when you clear tasks > # from the CLI or the UI), this defines the frequency at which they should > # listen (in seconds). > job_heartbeat_sec = 5 > > Bolke. > > > > On 17 Dec 2017, at 20:59, Christopher Bockman <[email protected]> > wrote: > > > >> P.S. I am assuming that you are talking about your scheduler going down, > > not workers > > > > Correct (and, in some unfortunate scenarios, everything else...) > > > >> Normally a task will detect (on the heartbeat interval) whether its > state > > was changed externally and will terminate itself. > > > > Hmm, that would be an acceptable solution, but this doesn't > (automatically, > > in our current configuration) occur. How can we encourage this behavior > to > > happen? > > > > > > On Sun, Dec 17, 2017 at 11:47 AM, Bolke de Bruin <[email protected]> > wrote: > > > >> Quite important to know is, is that Airflow’s executors do not keep > state > >> after a restart. This particularly affects distributed executors > (celery, > >> dask) as the workers are independent from the scheduler. Thus at > restart we > >> reset all the tasks in the queued state that the executor does not know > >> about, which means all of them at the moment. Due to the distributed > nature > >> of the executors, tasks can still be running. Normally a task will > detect > >> (on the heartbeat interval) whether its state was changed externally and > >> will terminate itself. > >> > >> I have done some work some months ago to make the executor keep state > over > >> restarts, but never got around to finish it. > >> > >> So at the moment, to prevent requeuing, you need to make the airflow > >> scheduler no go down (as much). > >> > >> Bolke. > >> > >> P.S. I am assuming that you are talking about your scheduler going down, > >> not workers > >> > >>> On 17 Dec 2017, at 20:07, Christopher Bockman <[email protected]> > >> wrote: > >>> > >>> Upon further internal discussion, we might be seeing the task cloning > >>> because the postgres DB is getting into a corrupted state...but > unclear. > >>> If consensus is we *shouldn't* be seeing this behavior, even as-is, > we'll > >>> push more on that angle. > >>> > >>> On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman < > >> [email protected] > >>>> wrote: > >>> > >>>> Hi all, > >>>> > >>>> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe > >>>> something as simple as the underlying infrastructure going down). > >>>> > >>>> Currently, we run everything on Kubernetes (including Airflow), so the > >>>> Airflow pods crashes generally will be detected, and then they will > >> restart. > >>>> > >>>> However, if we have, e.g., a DAG that is running task X when it > crashes, > >>>> when Airflow comes back up, it apparently sees task X didn't complete, > >> so > >>>> it restarts the task (which, in this case, means it spins up an > entirely > >>>> new instance/pod). Thus, both run "X_1" and "X_2" are fired off > >>>> simultaneously. > >>>> > >>>> Is there any (out of the box) way to better connect up state between > >> tasks > >>>> and Airflow to prevent this? > >>>> > >>>> (For additional context, we currently execute Kubernetes jobs via a > >> custom > >>>> operator that basically layers on top of BashOperator...perhaps the > new > >>>> Kubernetes operator will help address this?) > >>>> > >>>> Thank you in advance for any thoughts, > >>>> > >>>> Chris > >>>> > >> > >> > >
