Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Hmm, perhaps we've just had a couple of bad/unlucky runs but, in general,
the underlying task-kill process doesn't really seem to work, from what
we've seen.  I would guess this is related to
https://issues.apache.org/jira/browse/AIRFLOW-1623.



On Sun, Dec 17, 2017 at 12:22 PM, Bolke de Bruin  wrote:

> Shorter heartbeats, you might still have some tasks being scheduled
> nevertheless due to the time window. However, if the tasks detects it is
> running somewhere else, it should also terminate itself.
>
> [scheduler]
> # Task instances listen for external kill signal (when you clear tasks
> # from the CLI or the UI), this defines the frequency at which they should
> # listen (in seconds).
> job_heartbeat_sec = 5
>
> Bolke.
>
>
> > On 17 Dec 2017, at 20:59, Christopher Bockman 
> wrote:
> >
> >> P.S. I am assuming that you are talking about your scheduler going down,
> > not workers
> >
> > Correct (and, in some unfortunate scenarios, everything else...)
> >
> >> Normally a task will detect (on the heartbeat interval) whether its
> state
> > was changed externally and will terminate itself.
> >
> > Hmm, that would be an acceptable solution, but this doesn't
> (automatically,
> > in our current configuration) occur.  How can we encourage this behavior
> to
> > happen?
> >
> >
> > On Sun, Dec 17, 2017 at 11:47 AM, Bolke de Bruin 
> wrote:
> >
> >> Quite important to know is, is that Airflow’s executors do not keep
> state
> >> after a restart. This particularly affects distributed executors
> (celery,
> >> dask) as the workers are independent from the scheduler. Thus at
> restart we
> >> reset all the tasks in the queued state that the executor does not know
> >> about, which means all of them at the moment. Due to the distributed
> nature
> >> of the executors, tasks can still be running. Normally a task will
> detect
> >> (on the heartbeat interval) whether its state was changed externally and
> >> will terminate itself.
> >>
> >> I have done some work some months ago to make the executor keep state
> over
> >> restarts, but never got around to finish it.
> >>
> >> So at the moment, to prevent requeuing, you need to make the airflow
> >> scheduler no go down (as much).
> >>
> >> Bolke.
> >>
> >> P.S. I am assuming that you are talking about your scheduler going down,
> >> not workers
> >>
> >>> On 17 Dec 2017, at 20:07, Christopher Bockman 
> >> wrote:
> >>>
> >>> Upon further internal discussion, we might be seeing the task cloning
> >>> because the postgres DB is getting into a corrupted state...but
> unclear.
> >>> If consensus is we *shouldn't* be seeing this behavior, even as-is,
> we'll
> >>> push more on that angle.
> >>>
> >>> On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <
> >> ch...@fathomhealth.co
>  wrote:
> >>>
>  Hi all,
> 
>  We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
>  something as simple as the underlying infrastructure going down).
> 
>  Currently, we run everything on Kubernetes (including Airflow), so the
>  Airflow pods crashes generally will be detected, and then they will
> >> restart.
> 
>  However, if we have, e.g., a DAG that is running task X when it
> crashes,
>  when Airflow comes back up, it apparently sees task X didn't complete,
> >> so
>  it restarts the task (which, in this case, means it spins up an
> entirely
>  new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
>  simultaneously.
> 
>  Is there any (out of the box) way to better connect up state between
> >> tasks
>  and Airflow to prevent this?
> 
>  (For additional context, we currently execute Kubernetes jobs via a
> >> custom
>  operator that basically layers on top of BashOperator...perhaps the
> new
>  Kubernetes operator will help address this?)
> 
>  Thank you in advance for any thoughts,
> 
>  Chris
> 
> >>
> >>
>
>


Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Bolke de Bruin
Shorter heartbeats, you might still have some tasks being scheduled 
nevertheless due to the time window. However, if the tasks detects it is 
running somewhere else, it should also terminate itself.

[scheduler]
# Task instances listen for external kill signal (when you clear tasks
# from the CLI or the UI), this defines the frequency at which they should
# listen (in seconds).
job_heartbeat_sec = 5

Bolke.


> On 17 Dec 2017, at 20:59, Christopher Bockman  wrote:
> 
>> P.S. I am assuming that you are talking about your scheduler going down,
> not workers
> 
> Correct (and, in some unfortunate scenarios, everything else...)
> 
>> Normally a task will detect (on the heartbeat interval) whether its state
> was changed externally and will terminate itself.
> 
> Hmm, that would be an acceptable solution, but this doesn't (automatically,
> in our current configuration) occur.  How can we encourage this behavior to
> happen?
> 
> 
> On Sun, Dec 17, 2017 at 11:47 AM, Bolke de Bruin  wrote:
> 
>> Quite important to know is, is that Airflow’s executors do not keep state
>> after a restart. This particularly affects distributed executors (celery,
>> dask) as the workers are independent from the scheduler. Thus at restart we
>> reset all the tasks in the queued state that the executor does not know
>> about, which means all of them at the moment. Due to the distributed nature
>> of the executors, tasks can still be running. Normally a task will detect
>> (on the heartbeat interval) whether its state was changed externally and
>> will terminate itself.
>> 
>> I have done some work some months ago to make the executor keep state over
>> restarts, but never got around to finish it.
>> 
>> So at the moment, to prevent requeuing, you need to make the airflow
>> scheduler no go down (as much).
>> 
>> Bolke.
>> 
>> P.S. I am assuming that you are talking about your scheduler going down,
>> not workers
>> 
>>> On 17 Dec 2017, at 20:07, Christopher Bockman 
>> wrote:
>>> 
>>> Upon further internal discussion, we might be seeing the task cloning
>>> because the postgres DB is getting into a corrupted state...but unclear.
>>> If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll
>>> push more on that angle.
>>> 
>>> On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <
>> ch...@fathomhealth.co
 wrote:
>>> 
 Hi all,
 
 We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
 something as simple as the underlying infrastructure going down).
 
 Currently, we run everything on Kubernetes (including Airflow), so the
 Airflow pods crashes generally will be detected, and then they will
>> restart.
 
 However, if we have, e.g., a DAG that is running task X when it crashes,
 when Airflow comes back up, it apparently sees task X didn't complete,
>> so
 it restarts the task (which, in this case, means it spins up an entirely
 new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
 simultaneously.
 
 Is there any (out of the box) way to better connect up state between
>> tasks
 and Airflow to prevent this?
 
 (For additional context, we currently execute Kubernetes jobs via a
>> custom
 operator that basically layers on top of BashOperator...perhaps the new
 Kubernetes operator will help address this?)
 
 Thank you in advance for any thoughts,
 
 Chris
 
>> 
>> 



Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
> P.S. I am assuming that you are talking about your scheduler going down,
not workers

Correct (and, in some unfortunate scenarios, everything else...)

> Normally a task will detect (on the heartbeat interval) whether its state
was changed externally and will terminate itself.

Hmm, that would be an acceptable solution, but this doesn't (automatically,
in our current configuration) occur.  How can we encourage this behavior to
happen?


On Sun, Dec 17, 2017 at 11:47 AM, Bolke de Bruin  wrote:

> Quite important to know is, is that Airflow’s executors do not keep state
> after a restart. This particularly affects distributed executors (celery,
> dask) as the workers are independent from the scheduler. Thus at restart we
> reset all the tasks in the queued state that the executor does not know
> about, which means all of them at the moment. Due to the distributed nature
> of the executors, tasks can still be running. Normally a task will detect
> (on the heartbeat interval) whether its state was changed externally and
> will terminate itself.
>
> I have done some work some months ago to make the executor keep state over
> restarts, but never got around to finish it.
>
> So at the moment, to prevent requeuing, you need to make the airflow
> scheduler no go down (as much).
>
> Bolke.
>
> P.S. I am assuming that you are talking about your scheduler going down,
> not workers
>
> > On 17 Dec 2017, at 20:07, Christopher Bockman 
> wrote:
> >
> > Upon further internal discussion, we might be seeing the task cloning
> > because the postgres DB is getting into a corrupted state...but unclear.
> > If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll
> > push more on that angle.
> >
> > On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <
> ch...@fathomhealth.co
> >> wrote:
> >
> >> Hi all,
> >>
> >> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
> >> something as simple as the underlying infrastructure going down).
> >>
> >> Currently, we run everything on Kubernetes (including Airflow), so the
> >> Airflow pods crashes generally will be detected, and then they will
> restart.
> >>
> >> However, if we have, e.g., a DAG that is running task X when it crashes,
> >> when Airflow comes back up, it apparently sees task X didn't complete,
> so
> >> it restarts the task (which, in this case, means it spins up an entirely
> >> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
> >> simultaneously.
> >>
> >> Is there any (out of the box) way to better connect up state between
> tasks
> >> and Airflow to prevent this?
> >>
> >> (For additional context, we currently execute Kubernetes jobs via a
> custom
> >> operator that basically layers on top of BashOperator...perhaps the new
> >> Kubernetes operator will help address this?)
> >>
> >> Thank you in advance for any thoughts,
> >>
> >> Chris
> >>
>
>


Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Bolke de Bruin
Quite important to know is, is that Airflow’s executors do not keep state after 
a restart. This particularly affects distributed executors (celery, dask) as 
the workers are independent from the scheduler. Thus at restart we reset all 
the tasks in the queued state that the executor does not know about, which 
means all of them at the moment. Due to the distributed nature of the 
executors, tasks can still be running. Normally a task will detect (on the 
heartbeat interval) whether its state was changed externally and will terminate 
itself.

I have done some work some months ago to make the executor keep state over 
restarts, but never got around to finish it.

So at the moment, to prevent requeuing, you need to make the airflow scheduler 
no go down (as much).

Bolke.

P.S. I am assuming that you are talking about your scheduler going down, not 
workers

> On 17 Dec 2017, at 20:07, Christopher Bockman  wrote:
> 
> Upon further internal discussion, we might be seeing the task cloning
> because the postgres DB is getting into a corrupted state...but unclear.
> If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll
> push more on that angle.
> 
> On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman > wrote:
> 
>> Hi all,
>> 
>> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
>> something as simple as the underlying infrastructure going down).
>> 
>> Currently, we run everything on Kubernetes (including Airflow), so the
>> Airflow pods crashes generally will be detected, and then they will restart.
>> 
>> However, if we have, e.g., a DAG that is running task X when it crashes,
>> when Airflow comes back up, it apparently sees task X didn't complete, so
>> it restarts the task (which, in this case, means it spins up an entirely
>> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
>> simultaneously.
>> 
>> Is there any (out of the box) way to better connect up state between tasks
>> and Airflow to prevent this?
>> 
>> (For additional context, we currently execute Kubernetes jobs via a custom
>> operator that basically layers on top of BashOperator...perhaps the new
>> Kubernetes operator will help address this?)
>> 
>> Thank you in advance for any thoughts,
>> 
>> Chris
>> 



Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Upon further internal discussion, we might be seeing the task cloning
because the postgres DB is getting into a corrupted state...but unclear.
If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll
push more on that angle.

On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman  wrote:

> Hi all,
>
> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
> something as simple as the underlying infrastructure going down).
>
> Currently, we run everything on Kubernetes (including Airflow), so the
> Airflow pods crashes generally will be detected, and then they will restart.
>
> However, if we have, e.g., a DAG that is running task X when it crashes,
> when Airflow comes back up, it apparently sees task X didn't complete, so
> it restarts the task (which, in this case, means it spins up an entirely
> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
> simultaneously.
>
> Is there any (out of the box) way to better connect up state between tasks
> and Airflow to prevent this?
>
> (For additional context, we currently execute Kubernetes jobs via a custom
> operator that basically layers on top of BashOperator...perhaps the new
> Kubernetes operator will help address this?)
>
> Thank you in advance for any thoughts,
>
> Chris
>


how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Hi all,

We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
something as simple as the underlying infrastructure going down).

Currently, we run everything on Kubernetes (including Airflow), so the
Airflow pods crashes generally will be detected, and then they will restart.

However, if we have, e.g., a DAG that is running task X when it crashes,
when Airflow comes back up, it apparently sees task X didn't complete, so
it restarts the task (which, in this case, means it spins up an entirely
new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
simultaneously.

Is there any (out of the box) way to better connect up state between tasks
and Airflow to prevent this?

(For additional context, we currently execute Kubernetes jobs via a custom
operator that basically layers on top of BashOperator...perhaps the new
Kubernetes operator will help address this?)

Thank you in advance for any thoughts,

Chris


Re: [VOTE] Airflow 1.9.0rc8

2017-12-17 Thread Bolke de Bruin
This is just matter of setting the tag in the repo right? 

We should remove that check or make it not fail at least. It is ridiculous. 

B.

Verstuurd vanaf mijn iPad

> Op 17 dec. 2017 om 07:32 heeft Joy Gao  het volgende 
> geschreven:
> 
> Ahh, tested the build on a fresh virtualenv, which succeeded *the first
> time* given gitpython was not installed and it skipped the assertion check
> .
> Build fails on re-installs :(
> 
>> On Fri, Dec 15, 2017 at 6:01 PM, Feng Lu  wrote:
>> 
>> +0.5 (non-binding)
>> 
>> Looks like the version(1.9.0) and tag(1.9.0rc8) is mismatched,
>> 
>> which will cause the installation (pip install or python setup) to error
>> out and fail.
>> nit: mind also updating the release log "
>> https://github.com/apache/incubator-airflow/blob/1.9.0rc8/CHANGELOG.txt;
>> 
>> 
>> On Fri, Dec 15, 2017 at 3:21 PM, Driesprong, Fokko 
>> wrote:
>> 
>>> +1 binding
>>> 
>>> Op vr 15 dec. 2017 om 23:39 schreef Bolke de Bruin 
>>> 
 +1, binding
 
 Checked sigs, version, source is there (did not check build), bin is
>>> there.
 
 Bolke
 
 Verstuurd vanaf mijn iPad
 
> Op 15 dec. 2017 om 23:31 heeft Joy Gao  het volgende
 geschreven:
> 
> +1, binding
> 
> Thank you Chris!
> 
> On Fri, Dec 15, 2017 at 2:30 PM, Chris Riccomini <
>>> criccom...@apache.org>
> wrote:
> 
>> Hey all,
>> 
>> (Last time, I hope)^2
>> 
>> I have cut Airflow 1.9.0 RC8. This email is calling a vote on the
 release,
>> which will last for 72 hours. Consider this my (binding) +1.
>> 
>> Airflow 1.9.0 RC8 is available at:
>> 
>> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.9.0rc8/
>> 
>> apache-airflow-1.9.0rc8+incubating-source.tar.gz is a source
>> release
 that
>> comes with INSTALL instructions.
>> apache-airflow-1.9.0rc8+incubating-bin.tar.gz is the binary Python
 "sdist"
>> release.
>> 
>> Public keys are available at:
>> 
>> https://dist.apache.org/repos/dist/release/incubator/airflow/
>> 
>> The release contains no new JIRAs. Just a version fix.
>> 
>> I also had to change the version number to exclude the `rc6` string
>> as
 well
>> as the "+incubating" string, so it's now simply 1.9.0. This will
>> allow
 us
>> to rename the artifact without modifying the artifact checksums when
>>> we
>> actually release.
>> 
>> See JIRAs that were in 1.9.0RC7 and before (see previous VOTE email
>>> for
>> full list).
>> 
>> Cheers,
>> Chris
>> 
 
>>> 
>>