Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Christopher Bockman
+1 to being able to disable--we have authentication in place, but use a
separate solution that (probably?) Airflow won't realize is enabled, so
having a continuous giant warning banner would be rather unfortunate.

On Tue, Jun 5, 2018 at 2:05 PM, Alek Storm  wrote:

> This is a great idea, but we'd appreciate a setting that disables the
> banner even if those conditions aren't met - our instance is deployed
> without authentication, but is only accessible via our intranet.
>
> Alek
>
>
> On Tue, Jun 5, 2018, 3:35 PM James Meickle 
> wrote:
>
> > I think that a banner notification would be a fair penalty if you access
> > Airflow without authentication, or have API authentication turned off, or
> > are accessing via http:// with a non-localhost `Host:`. (Are there any
> > other circumstances to think of?)
> >
> > I would also suggest serving a default robots.txt to mitigate accidental
> > indexing of public instances (as most public instances will be
> accidentally
> > public, statistically speaking). If you truly want your Airflow instance
> > public and indexed, you should have to go out of your way to permit that.
> >
> > On Tue, Jun 5, 2018 at 1:51 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > What about a clear alert on the UI showing when auth is off? Perhaps a
> > > large red triangle-exclamation icon on the navbar with a tooltip
> > > "Authentication is off, this Airflow instance in not secure." and
> > clicking
> > > take you to the doc's security page.
> > >
> > > Well and then of course people should make sure their infra isn't open
> to
> > > the Internet. We really shouldn't have to tell people to keep their
> > > infrastructure behind a firewall. In most environments you have to do
> > quite
> > > a bit of work to open any resource up to the Internet (SSL certs,
> special
> > > security groups for load balancers/proxies, ...). Now I'm curious to
> > > understand how UMG managed to do this by mistake...
> > >
> > > Also a quick reminder to use the Connection abstraction to store
> secrets,
> > > ideally using the environment variable feature.
> > >
> > > Max
> > >
> > > On Tue, Jun 5, 2018 at 10:02 AM Taylor Edmiston 
> > > wrote:
> > >
> > > > One of our engineers wrote a blog post about the UMG mistakes as
> well.
> > > >
> > > > https://www.astronomer.io/blog/universal-music-group-airflow-leak/
> > > >
> > > > I know that best practices are well known here, but I second James'
> > > > suggestion that we add some docs, code, or config so that the
> framework
> > > > optimizes for being (nearly) production-ready by default and not just
> > > easy
> > > > to start with for local dev.  Admittedly this takes some work to not
> > add
> > > > friction to the local onboarding experience.
> > > >
> > > > Do most people keep separate airflow.cfg files per environment like
> > > what's
> > > > considered the best practice in the Django world?  e.g.
> > > > https://stackoverflow.com/q/10664244/149428
> > > >
> > > > Taylor
> > > >
> > > > *Taylor Edmiston*
> > > > Blog  | CV
> > > >  | LinkedIn
> > > >  | AngelList
> > > >  | Stack Overflow
> > > > 
> > > >
> > > >
> > > > On Tue, Jun 5, 2018 at 9:57 AM, James Meickle <
> jmeic...@quantopian.com
> > >
> > > > wrote:
> > > >
> > > > > Bumping this one because now Airflow is in the news over it...
> > > > >
> > > > > https://www.bleepingcomputer.com/news/security/contractor-
> > > > > exposes-credentials-for-universal-music-groups-it-
> > > > > infrastructure/?utm_campaign=Security%2BNewsletter_
> > > > > medium=email_source=Security_Newsletter_co_79
> > > > >
> > > > > On Fri, Mar 23, 2018 at 9:33 AM, James Meickle <
> > > jmeic...@quantopian.com>
> > > > > wrote:
> > > > >
> > > > > > While Googling something Airflow-related a few weeks ago, I
> noticed
> > > > that
> > > > > > someone's Airflow dashboard had been indexed by Google and was
> > > > accessible
> > > > > > to the outside world without authentication. A little more
> Googling
> > > > > > revealed a handful of other indexed instances in various states
> of
> > > > > > security. I did my best to contact the operators, and waited for
> > > > > responses
> > > > > > before posting this.
> > > > > >
> > > > > > Airflow is not a secure project by default (
> > > https://issues.apache.org/
> > > > > > jira/browse/AIRFLOW-2047), and you can do all sorts of mean
> things
> > to
> > > > an
> > > > > > instance that hasn't been intentionally locked down. (And even
> > then,
> > > > you
> > > > > > shouldn't rely exclusively on your app's authentication for
> > providing
> > > > > > security.)
> > > > > >
> > > > > > Having "internal" dashboards/data sources/executors exposed to
> the
> > > web
> > > > is
> > > > > > dangerous, since old versions can stick around for a very long
> > time,
> 

Re: How to wait for external process

2018-05-28 Thread Christopher Bockman
Haven't done this, but we'll have a similar need in the future, so have
investigated a little.

What about a design pattern something like this:

1) When jobs are done (ready for further processing) they publish those
details to a queue (such as GC Pub/Sub or any other sort of queue)

2) A single "listener" DAG sits and periodically checks that queue.  If it
finds anything on it, it triggers (via DAG trigger) all of the DAGs which
are on the queue.*

* = if your triggering volume is too high, this may cause airflow issues w/
too many going at once; this could presumably be solved then via custom
rate-limiting on firing these

3) The listener DAG resets itself (triggers itself)


On Mon, May 28, 2018 at 7:17 AM, Driesprong, Fokko 
wrote:

> Hi Stefan,
>
> Afaik there isn't a more efficient way of doing this. DAGs that are relying
> on a lot of sensors are experiencing the same issues. The only way right
> now, I can think of, is doing updating the state directly in the database.
> But then you need to know what you are doing. I can image that this would
> be feasible by using an AWS lambda function. Hope this helps.
>
> Cheers, Fokko
>
> 2018-05-26 17:50 GMT+02:00 Stefan Seelmann :
>
> > Hello,
> >
> > I have a DAG (externally triggered) where some processing is done at an
> > external system (EC2 instance). The processing is started by an Airflow
> > task (via HTTP request). The DAG should only continue once that
> > processing is completed. In a first naive implementation I created a
> > sensor that gets the progress (via HTTP request) and only if status is
> > "finished" returns true and the DAG run continues. That works but...
> >
> > ... the external processing can take hours or days, and during that time
> > a worker is occupied which does nothing but HTTP GET and sleep. There
> > will be hundreds of DAG runs in parallel which means hundreds of workers
> > are occupied.
> >
> > I looked into other operators that do computation on external systems
> > (ECSOperator, AWSBatchOperator) but they also follow that pattern and
> > just wait/sleep.
> >
> > So I want to ask if there is a more efficient way to build such a
> > workflow with Airflow?
> >
> > Kind Regards,
> > Stefan
> >
>


Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Hmm, perhaps we've just had a couple of bad/unlucky runs but, in general,
the underlying task-kill process doesn't really seem to work, from what
we've seen.  I would guess this is related to
https://issues.apache.org/jira/browse/AIRFLOW-1623.



On Sun, Dec 17, 2017 at 12:22 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Shorter heartbeats, you might still have some tasks being scheduled
> nevertheless due to the time window. However, if the tasks detects it is
> running somewhere else, it should also terminate itself.
>
> [scheduler]
> # Task instances listen for external kill signal (when you clear tasks
> # from the CLI or the UI), this defines the frequency at which they should
> # listen (in seconds).
> job_heartbeat_sec = 5
>
> Bolke.
>
>
> > On 17 Dec 2017, at 20:59, Christopher Bockman <ch...@fathomhealth.co>
> wrote:
> >
> >> P.S. I am assuming that you are talking about your scheduler going down,
> > not workers
> >
> > Correct (and, in some unfortunate scenarios, everything else...)
> >
> >> Normally a task will detect (on the heartbeat interval) whether its
> state
> > was changed externally and will terminate itself.
> >
> > Hmm, that would be an acceptable solution, but this doesn't
> (automatically,
> > in our current configuration) occur.  How can we encourage this behavior
> to
> > happen?
> >
> >
> > On Sun, Dec 17, 2017 at 11:47 AM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >
> >> Quite important to know is, is that Airflow’s executors do not keep
> state
> >> after a restart. This particularly affects distributed executors
> (celery,
> >> dask) as the workers are independent from the scheduler. Thus at
> restart we
> >> reset all the tasks in the queued state that the executor does not know
> >> about, which means all of them at the moment. Due to the distributed
> nature
> >> of the executors, tasks can still be running. Normally a task will
> detect
> >> (on the heartbeat interval) whether its state was changed externally and
> >> will terminate itself.
> >>
> >> I have done some work some months ago to make the executor keep state
> over
> >> restarts, but never got around to finish it.
> >>
> >> So at the moment, to prevent requeuing, you need to make the airflow
> >> scheduler no go down (as much).
> >>
> >> Bolke.
> >>
> >> P.S. I am assuming that you are talking about your scheduler going down,
> >> not workers
> >>
> >>> On 17 Dec 2017, at 20:07, Christopher Bockman <ch...@fathomhealth.co>
> >> wrote:
> >>>
> >>> Upon further internal discussion, we might be seeing the task cloning
> >>> because the postgres DB is getting into a corrupted state...but
> unclear.
> >>> If consensus is we *shouldn't* be seeing this behavior, even as-is,
> we'll
> >>> push more on that angle.
> >>>
> >>> On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <
> >> ch...@fathomhealth.co
> >>>> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
> >>>> something as simple as the underlying infrastructure going down).
> >>>>
> >>>> Currently, we run everything on Kubernetes (including Airflow), so the
> >>>> Airflow pods crashes generally will be detected, and then they will
> >> restart.
> >>>>
> >>>> However, if we have, e.g., a DAG that is running task X when it
> crashes,
> >>>> when Airflow comes back up, it apparently sees task X didn't complete,
> >> so
> >>>> it restarts the task (which, in this case, means it spins up an
> entirely
> >>>> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
> >>>> simultaneously.
> >>>>
> >>>> Is there any (out of the box) way to better connect up state between
> >> tasks
> >>>> and Airflow to prevent this?
> >>>>
> >>>> (For additional context, we currently execute Kubernetes jobs via a
> >> custom
> >>>> operator that basically layers on top of BashOperator...perhaps the
> new
> >>>> Kubernetes operator will help address this?)
> >>>>
> >>>> Thank you in advance for any thoughts,
> >>>>
> >>>> Chris
> >>>>
> >>
> >>
>
>


Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
> P.S. I am assuming that you are talking about your scheduler going down,
not workers

Correct (and, in some unfortunate scenarios, everything else...)

> Normally a task will detect (on the heartbeat interval) whether its state
was changed externally and will terminate itself.

Hmm, that would be an acceptable solution, but this doesn't (automatically,
in our current configuration) occur.  How can we encourage this behavior to
happen?


On Sun, Dec 17, 2017 at 11:47 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Quite important to know is, is that Airflow’s executors do not keep state
> after a restart. This particularly affects distributed executors (celery,
> dask) as the workers are independent from the scheduler. Thus at restart we
> reset all the tasks in the queued state that the executor does not know
> about, which means all of them at the moment. Due to the distributed nature
> of the executors, tasks can still be running. Normally a task will detect
> (on the heartbeat interval) whether its state was changed externally and
> will terminate itself.
>
> I have done some work some months ago to make the executor keep state over
> restarts, but never got around to finish it.
>
> So at the moment, to prevent requeuing, you need to make the airflow
> scheduler no go down (as much).
>
> Bolke.
>
> P.S. I am assuming that you are talking about your scheduler going down,
> not workers
>
> > On 17 Dec 2017, at 20:07, Christopher Bockman <ch...@fathomhealth.co>
> wrote:
> >
> > Upon further internal discussion, we might be seeing the task cloning
> > because the postgres DB is getting into a corrupted state...but unclear.
> > If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll
> > push more on that angle.
> >
> > On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <
> ch...@fathomhealth.co
> >> wrote:
> >
> >> Hi all,
> >>
> >> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
> >> something as simple as the underlying infrastructure going down).
> >>
> >> Currently, we run everything on Kubernetes (including Airflow), so the
> >> Airflow pods crashes generally will be detected, and then they will
> restart.
> >>
> >> However, if we have, e.g., a DAG that is running task X when it crashes,
> >> when Airflow comes back up, it apparently sees task X didn't complete,
> so
> >> it restarts the task (which, in this case, means it spins up an entirely
> >> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
> >> simultaneously.
> >>
> >> Is there any (out of the box) way to better connect up state between
> tasks
> >> and Airflow to prevent this?
> >>
> >> (For additional context, we currently execute Kubernetes jobs via a
> custom
> >> operator that basically layers on top of BashOperator...perhaps the new
> >> Kubernetes operator will help address this?)
> >>
> >> Thank you in advance for any thoughts,
> >>
> >> Chris
> >>
>
>


Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Upon further internal discussion, we might be seeing the task cloning
because the postgres DB is getting into a corrupted state...but unclear.
If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll
push more on that angle.

On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <ch...@fathomhealth.co
> wrote:

> Hi all,
>
> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
> something as simple as the underlying infrastructure going down).
>
> Currently, we run everything on Kubernetes (including Airflow), so the
> Airflow pods crashes generally will be detected, and then they will restart.
>
> However, if we have, e.g., a DAG that is running task X when it crashes,
> when Airflow comes back up, it apparently sees task X didn't complete, so
> it restarts the task (which, in this case, means it spins up an entirely
> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
> simultaneously.
>
> Is there any (out of the box) way to better connect up state between tasks
> and Airflow to prevent this?
>
> (For additional context, we currently execute Kubernetes jobs via a custom
> operator that basically layers on top of BashOperator...perhaps the new
> Kubernetes operator will help address this?)
>
> Thank you in advance for any thoughts,
>
> Chris
>


how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Hi all,

We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
something as simple as the underlying infrastructure going down).

Currently, we run everything on Kubernetes (including Airflow), so the
Airflow pods crashes generally will be detected, and then they will restart.

However, if we have, e.g., a DAG that is running task X when it crashes,
when Airflow comes back up, it apparently sees task X didn't complete, so
it restarts the task (which, in this case, means it spins up an entirely
new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
simultaneously.

Is there any (out of the box) way to better connect up state between tasks
and Airflow to prevent this?

(For additional context, we currently execute Kubernetes jobs via a custom
operator that basically layers on top of BashOperator...perhaps the new
Kubernetes operator will help address this?)

Thank you in advance for any thoughts,

Chris


Re: Meetup Interest?

2017-10-13 Thread Christopher Bockman
+1 as a vote.

We're very actively working on Kube+Airflow, so would be particularly
interested on discussions there.

On Fri, Oct 13, 2017 at 12:59 PM, Joy Gao  wrote:

> Hi Dan,
>
> I'd be happy to give an update on progress of the new RBAC UI we've been
> working on here at WePay.
>
> Cheers,
> Joy
>
> On Fri, Oct 13, 2017 at 12:10 PM, Dan Davydov <
> dan.davy...@airbnb.com.invalid> wrote:
>
> > Is there interest in doing an Airflow meet-up? Airbnb can host one in San
> > Francisco.
> >
> > Some talk ideas can include the progress on Kubernetes integration and
> > Scaling & Operations with Airflow. If you want to see other topics
> covered,
> > feel free to suggest them!
> >
>


Re: Airflow + Kubernetes update meeting

2017-09-05 Thread Christopher Bockman
@Daniel great, thanks!  We'd love to participate--have just started
combining Kubernetes+Airflow and are presumably hitting all of the same
issues everyone else is.

On Tue, Sep 5, 2017 at 11:10 AM, Dan Davydov <dan.davy...@airbnb.com> wrote:

> Works for me as well!
>
> On Tue, Sep 5, 2017 at 10:43 AM Daniel Imberman <daniel.imber...@gmail.com>
> wrote:
>
>> @Marc we will make sure to record the meeting/supply notes. This should
>> be a pretty straightforward update/overview meeint.
>> @ChrisB this meeting will be a virtual meeting, though Bloomberg is
>> definitely interested in hosting an airflow meetup at our SF location if
>> there is sufficient interest :).
>> @ChrisR Great to hear :). We've been working with members of the
>> openshift community so we can definitely speak to those requirem
>>
>>
>> On Tue, Sep 5, 2017 at 10:24 AM Feng Lu <fen...@google.com.invalid>
>> wrote:
>>
>>> +1, either way works for me.
>>>
>>> On Tue, Sep 5, 2017 at 10:10 AM, Chris Riccomini <criccom...@apache.org>
>>> wrote:
>>>
>>> > Works for me.
>>> >
>>> > On Tue, Sep 5, 2017 at 7:44 AM, Grant Nicholas <grantnicholas2015@u.
>>> > northwestern.edu> wrote:
>>> >
>>> >> +1 for me if it works with others.
>>> >>
>>> >> On Mon, Sep 4, 2017 at 11:02 PM, Anirudh Ramanathan <
>>> >> ramanath...@google.com> wrote:
>>> >>
>>> >>> Date/time work for me if we get quorum from this group.
>>> >>>
>>> >>> On Thu, Aug 31, 2017 at 7:54 PM, Christopher Bockman <
>>> >>> ch...@fathomhealth.co> wrote:
>>> >>>
>>> >>>> Hi Daniel, would this be remote or in person?
>>> >>>>
>>> >>>>
>>> >>>> On Aug 31, 2017 4:16 PM, "Daniel Imberman" <
>>> daniel.imber...@gmail.com>
>>> >>>> wrote:
>>> >>>>
>>> >>>> Hey guys!
>>> >>>>
>>> >>>> So I wanted to set up a meeting to discuss some of the
>>> updates/current
>>> >>>> work
>>> >>>> that is going on with both the kubernetes operator and kubernetes
>>> >>>> executor
>>> >>>> efforts. There has been some really cool updates/proposals on the
>>> >>>> design of
>>> >>>> these two features and I would love to get some community feedback
>>> to
>>> >>>> make
>>> >>>> sure that we are taking this in a direction that benefits everyone.
>>> >>>>
>>> >>>> I am thinking of having this meeting at 10:00AM on Thursday,
>>> September
>>> >>>> 7th
>>> >>>> PST. Would this time/place work?
>>> >>>>
>>> >>>> Thanks!
>>> >>>>
>>> >>>> Daniel
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Anirudh Ramanathan
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>


Re: Airflow + Kubernetes update meeting

2017-08-31 Thread Christopher Bockman
Hi Daniel, would this be remote or in person?

On Aug 31, 2017 4:16 PM, "Daniel Imberman" 
wrote:

Hey guys!

So I wanted to set up a meeting to discuss some of the updates/current work
that is going on with both the kubernetes operator and kubernetes executor
efforts. There has been some really cool updates/proposals on the design of
these two features and I would love to get some community feedback to make
sure that we are taking this in a direction that benefits everyone.

I am thinking of having this meeting at 10:00AM on Thursday, September 7th
PST. Would this time/place work?

Thanks!

Daniel