Yes, no invite required. See you tomorrow!
On 21 Apr 2021, 07:46 +0100, Sumit Maheshwari <msu...@apache.org>, wrote:
> I'll join as well (I believe the zoom link will work without an invite)
>
> > On Wed, Apr 21, 2021 at 10:48 AM Dimitris Stafylarakis <xan...@gmail.com>
> > wrote:
> > > hi all,
> > >
> > > great to read about this, I'd like to join in! Can I just join using the
> > > zoom link tomorrow or do I need an invitation? (If I do need one, please
> > > invite me :))
> > >
> > > cheers
> > >
> > >
> > > > On Wed, Apr 14, 2021 at 8:15 PM Daniel Imberman
> > > > <daniel.imber...@gmail.com> wrote:
> > > > > Thank you Ian,
> > > > >
> > > > > I’ve invited everyone on this thread to the meeting with that zoom
> > > > > link. Anyone else who wants to join can add the calendar event here
> > > > > calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=dan...@astronomer.io
> > > > >
> > > > > On Wed, Apr 14, 2021 at 11:05 AM, Ian Buss <ianjb...@gmail.com> wrote:
> > > > > > If this works for everyone, here's a zoom link for Thursday 8AM
> > > > > > PST:
> > > > > > https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09
> > > > > >
> > > > > > Happy to move or use an alternate method as needed.
> > > > > >
> > > > > > > On Wed, Apr 14, 2021 at 6:58 PM Daniel Imberman
> > > > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > > > Thursday works for me!
> > > > > > > >
> > > > > > > > On Wed, Apr 14, 2021 at 10:05 AM, Ian Buss <ianjb...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I actually can’t do Wednesday next week as I’m moving house
> > > > > > > > > :) Any chance we could do Thursday or Friday at the same time?
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > > Ian
> > > > > > > > > On 14 Apr 2021, 17:49 +0100, Kaxil Naik
> > > > > > > > > <kaxiln...@gmail.com>, wrote:
> > > > > > > > > > Just few comments here:
> > > > > > > > > >
> > > > > > > > > > Currently -- atleast for the foreseeable future Airflow
> > > > > > > > > > workers will need access to the DAG Files, so workers can
> > > > > > > > > > not run using the Serialized DAGs.
> > > > > > > > > >
> > > > > > > > > > Also serialized DAGs do not even have all the info needed
> > > > > > > > > > for it to run it. Currently the serialization happens in
> > > > > > > > > > the parsing process in the scheduler which can be in future
> > > > > > > > > > separated as a separator "parsining" component, but that
> > > > > > > > > > won't solve the "isolation" problem you are trying to
> > > > > > > > > > solve. The only current way it can be solved is pickling --
> > > > > > > > > > and we have strictly decided against using pickling for
> > > > > > > > > > DAGs.
> > > > > > > > > >
> > > > > > > > > > The idea in Statement (2) & (3) would help solve the
> > > > > > > > > > isolation problem in (1) and can be done with some work now.
> > > > > > > > > >
> > > > > > > > > > Happy to talk about it in more detail here or on call, the
> > > > > > > > > > time Daniel suggested works for me.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Kaxil
> > > > > > > > > >
> > > > > > > > > > > On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman
> > > > > > > > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > > > > > > > How about Wednesday, April 21 at 8:00AM PST?
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang
> > > > > > > > > > > > <bin.huan...@gmail.com> wrote:
> > > > > > > > > > > > > I am available any days.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman
> > > > > > > > > > > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > > > > > > > > > > Hi everyone!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Would people be available around 8AM/9AM PST some
> > > > > > > > > > > > > > > point next week? I’m in PST and Ian is UTC+1 so
> > > > > > > > > > > > > > > would be great to find a timezone that
> > > > > > > > > > > > > > > accomodates everyone.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Daniel
> > > > > > > > > > > > > > > On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter
> > > > > > > > > > > > > > > <ryannhat...@gmail.com> wrote:
> > > > > > > > > > > > > > > > I’d also like to be added please :)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Apr 13, 2021, at 21:27, Xinbin Huang
> > > > > > > > > > > > > > > > > <bin.huan...@gmail.com> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Daniel & Ian,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I am also interested in the idea of a
> > > > > > > > > > > > > > > > > serialization representation that can be
> > > > > > > > > > > > > > > > > executed by workers directly. Can you also
> > > > > > > > > > > > > > > > > add me to the call?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > > > Bin
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 2:49 PM Ian Buss
> > > > > > > > > > > > > > > > > > <ianjb...@gmail.com> wrote:
> > > > > > > > > > > > > > > > > > > Daniel,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for your warm welcome and quick
> > > > > > > > > > > > > > > > > > > response and the advice on providers!
> > > > > > > > > > > > > > > > > > > Will certainly check out the examples you
> > > > > > > > > > > > > > > > > > > sent.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 1. An "airflow register" command
> > > > > > > > > > > > > > > > > > > definitely sounds promising, would love
> > > > > > > > > > > > > > > > > > > to collaborate on an AIP there so let's
> > > > > > > > > > > > > > > > > > > set something up.
> > > > > > > > > > > > > > > > > > > 2. We use KubernetesExecutor exclusively
> > > > > > > > > > > > > > > > > > > as well. We've noticed significant
> > > > > > > > > > > > > > > > > > > additional load on the metadata DB as we
> > > > > > > > > > > > > > > > > > > scale up task pods so I've also thought
> > > > > > > > > > > > > > > > > > > about an API-based approach. Such an API
> > > > > > > > > > > > > > > > > > > could also open up the possibility of
> > > > > > > > > > > > > > > > > > > per-task security tokens which are
> > > > > > > > > > > > > > > > > > > injected by the scheduler, which should
> > > > > > > > > > > > > > > > > > > improve the security of such a system.
> > > > > > > > > > > > > > > > > > > Food for thought at least. I will start
> > > > > > > > > > > > > > > > > > > putting some of these thoughts down on
> > > > > > > > > > > > > > > > > > > paper in a sharable format.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Ian
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 7:46 PM Daniel
> > > > > > > > > > > > > > > > > > > > Imberman <daniel.imber...@gmail.com>
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > Hi Ian,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Firstly, welcome to the Airflow
> > > > > > > > > > > > > > > > > > > > > community :). I'm glad to hear you've
> > > > > > > > > > > > > > > > > > > > > had a positive experience so far.
> > > > > > > > > > > > > > > > > > > > > It's great to hear that you want to
> > > > > > > > > > > > > > > > > > > > > contribute back, and I think that
> > > > > > > > > > > > > > > > > > > > > multi-tenancy/DAG isolation is a
> > > > > > > > > > > > > > > > > > > > > pretty fantastic project for the
> > > > > > > > > > > > > > > > > > > > > community as a whole (a lot of things
> > > > > > > > > > > > > > > > > > > > > are are things we want but are
> > > > > > > > > > > > > > > > > > > > > limited by hours in a day).
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 1. I've personally been kicking
> > > > > > > > > > > > > > > > > > > > > around some ideas lately about an
> > > > > > > > > > > > > > > > > > > > > "airflow register" command that would
> > > > > > > > > > > > > > > > > > > > > write the DAG into the metadata DB in
> > > > > > > > > > > > > > > > > > > > > a way that could be "gettable" by the
> > > > > > > > > > > > > > > > > > > > > workers via the API. This work is
> > > > > > > > > > > > > > > > > > > > > very early. I'd love to get some help
> > > > > > > > > > > > > > > > > > > > > on it. Perhaps we can set up a zoom
> > > > > > > > > > > > > > > > > > > > > chat to discuss drafting an AIP?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 2. Limiting worker access to the DB
> > > > > > > > > > > > > > > > > > > > > is not only good security practice;
> > > > > > > > > > > > > > > > > > > > > it also opens up the door to a lot of
> > > > > > > > > > > > > > > > > > > > > valuable features. This feature would
> > > > > > > > > > > > > > > > > > > > > be especially close to my heart as it
> > > > > > > > > > > > > > > > > > > > > would make the KubernetesExecutor
> > > > > > > > > > > > > > > > > > > > > significantly more efficient. It
> > > > > > > > > > > > > > > > > > > > > should be possible to set up a system
> > > > > > > > > > > > > > > > > > > > > where the workers only ever speak to
> > > > > > > > > > > > > > > > > > > > > an API server and never need to touch
> > > > > > > > > > > > > > > > > > > > > the DB.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 3. This is not something I personally
> > > > > > > > > > > > > > > > > > > > > have insight into, but I think it
> > > > > > > > > > > > > > > > > > > > > sounds like a good idea.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Finally, addressing your question
> > > > > > > > > > > > > > > > > > > > > about a Cloudera provider. If
> > > > > > > > > > > > > > > > > > > > > anything, it would probably give the
> > > > > > > > > > > > > > > > > > > > > provider _more_ legitimacy if you
> > > > > > > > > > > > > > > > > > > > > hosted it under the Cloudera GitHub
> > > > > > > > > > > > > > > > > > > > > org (we very purposely created the
> > > > > > > > > > > > > > > > > > > > > provider packages with this workflow
> > > > > > > > > > > > > > > > > > > > > in mind). There are multiple places
> > > > > > > > > > > > > > > > > > > > > where we can work to surface this
> > > > > > > > > > > > > > > > > > > > > provider so it is easy to find and
> > > > > > > > > > > > > > > > > > > > > use.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Astronomer has a pretty good sample
> > > > > > > > > > > > > > > > > > > > > provider here. One example of it
> > > > > > > > > > > > > > > > > > > > > running in the wild is the Great
> > > > > > > > > > > > > > > > > > > > > Expectations provider here. I'd also
> > > > > > > > > > > > > > > > > > > > > be glad to get you in contact with
> > > > > > > > > > > > > > > > > > > > > people who have built providers in
> > > > > > > > > > > > > > > > > > > > > the past to help you with that
> > > > > > > > > > > > > > > > > > > > > process.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Looking forward to seeing some of
> > > > > > > > > > > > > > > > > > > > > these things come to fruition!
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Daniel
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 9:43 AM, Ian
> > > > > > > > > > > > > > > > > > > > > Buss <ianjb...@gmail.com> wrote:
> > > > > > > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > First a quick introduction: I'm an
> > > > > > > > > > > > > > > > > > > > > > engineer with Cloudera working on
> > > > > > > > > > > > > > > > > > > > > > our Data Engineering product (CDE).
> > > > > > > > > > > > > > > > > > > > > > Airflow is working great for us so
> > > > > > > > > > > > > > > > > > > > > > far. We've been looking into how we
> > > > > > > > > > > > > > > > > > > > > > can enhance the multi-tenancy story
> > > > > > > > > > > > > > > > > > > > > > of Apache Airflow as we currently
> > > > > > > > > > > > > > > > > > > > > > deploy it. We have the following
> > > > > > > > > > > > > > > > > > > > > > areas which we'd like (with
> > > > > > > > > > > > > > > > > > > > > > community consensus) to work on and
> > > > > > > > > > > > > > > > > > > > > > contribute back to Apache Airflow
> > > > > > > > > > > > > > > > > > > > > > to enhance the isolation between
> > > > > > > > > > > > > > > > > > > > > > tenants in a single Airflow
> > > > > > > > > > > > > > > > > > > > > > deployment.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 1. Isolating code execution and
> > > > > > > > > > > > > > > > > > > > > > parsing of DAG files. At the
> > > > > > > > > > > > > > > > > > > > > > moment, DAG files are parsed in a
> > > > > > > > > > > > > > > > > > > > > > few locations in Airflow, including
> > > > > > > > > > > > > > > > > > > > > > the scheduler and in tasks. There
> > > > > > > > > > > > > > > > > > > > > > is already the concept of DAG
> > > > > > > > > > > > > > > > > > > > > > serialization (and we're using that
> > > > > > > > > > > > > > > > > > > > > > for the web component) but we'd be
> > > > > > > > > > > > > > > > > > > > > > interested to see if we can sandbox
> > > > > > > > > > > > > > > > > > > > > > the execution of arbitrary user
> > > > > > > > > > > > > > > > > > > > > > code to a locked down
> > > > > > > > > > > > > > > > > > > > > > process/container without full
> > > > > > > > > > > > > > > > > > > > > > access to the metadata DB and
> > > > > > > > > > > > > > > > > > > > > > connection secrets etc. The idea
> > > > > > > > > > > > > > > > > > > > > > would be to parse and serialize the
> > > > > > > > > > > > > > > > > > > > > > DAG in this isolated container and
> > > > > > > > > > > > > > > > > > > > > > pass back a serialized
> > > > > > > > > > > > > > > > > > > > > > representation for persistence in
> > > > > > > > > > > > > > > > > > > > > > the DB. Has anyone explored this
> > > > > > > > > > > > > > > > > > > > > > idea?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 2. Limiting task access to the
> > > > > > > > > > > > > > > > > > > > > > metadata DB. It would be great if
> > > > > > > > > > > > > > > > > > > > > > we could remove the requirement for
> > > > > > > > > > > > > > > > > > > > > > tasks to have full access to the
> > > > > > > > > > > > > > > > > > > > > > metadata DB and to report task
> > > > > > > > > > > > > > > > > > > > > > status in a different (but still
> > > > > > > > > > > > > > > > > > > > > > scalable) way. We'd need to tackle
> > > > > > > > > > > > > > > > > > > > > > access or injection of connection,
> > > > > > > > > > > > > > > > > > > > > > variable and xcom data as well for
> > > > > > > > > > > > > > > > > > > > > > each task naturally.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 3. Finer-grained access controls on
> > > > > > > > > > > > > > > > > > > > > > connection secrets. Right now,
> > > > > > > > > > > > > > > > > > > > > > although there are nice at-rest
> > > > > > > > > > > > > > > > > > > > > > encryption options with Fernet or
> > > > > > > > > > > > > > > > > > > > > > Vault, IIUC any DAG can access any
> > > > > > > > > > > > > > > > > > > > > > connection (and thus any secret).
> > > > > > > > > > > > > > > > > > > > > > Since the "run as" user is largely
> > > > > > > > > > > > > > > > > > > > > > defined within the DAG and its
> > > > > > > > > > > > > > > > > > > > > > tasks, this is challenging for a
> > > > > > > > > > > > > > > > > > > > > > multi-tenant environment (see
> > > > > > > > > > > > > > > > > > > > > > caveat below)
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Caveat: It's definitely noted that
> > > > > > > > > > > > > > > > > > > > > > to some extent we should assume
> > > > > > > > > > > > > > > > > > > > > > that an Airflow deployment is a
> > > > > > > > > > > > > > > > > > > > > > "trusted" environment and that best
> > > > > > > > > > > > > > > > > > > > > > practices such as git+PR workflows
> > > > > > > > > > > > > > > > > > > > > > are the gold standard and that any
> > > > > > > > > > > > > > > > > > > > > > malicious code and dependencies
> > > > > > > > > > > > > > > > > > > > > > should be identified through this
> > > > > > > > > > > > > > > > > > > > > > process. Also that there is a clear
> > > > > > > > > > > > > > > > > > > > > > admin role for connection
> > > > > > > > > > > > > > > > > > > > > > management etc.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > We have some ideas informally
> > > > > > > > > > > > > > > > > > > > > > sketched out as to how to address
> > > > > > > > > > > > > > > > > > > > > > the above but would be keen to hear
> > > > > > > > > > > > > > > > > > > > > > the community opinion on this and
> > > > > > > > > > > > > > > > > > > > > > to see if anyone is keen to
> > > > > > > > > > > > > > > > > > > > > > collaborate on designs and
> > > > > > > > > > > > > > > > > > > > > > implementation, or to hear if
> > > > > > > > > > > > > > > > > > > > > > anything is already in the works.
> > > > > > > > > > > > > > > > > > > > > > In particular I noticed that the
> > > > > > > > > > > > > > > > > > > > > > very first improvement proposal
> > > > > > > > > > > > > > > > > > > > > > (AIP-1) addresses much of the above
> > > > > > > > > > > > > > > > > > > > > > :). However, it seems fairly
> > > > > > > > > > > > > > > > > > > > > > dormant at the moment.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > One other question: we have a
> > > > > > > > > > > > > > > > > > > > > > provider (operators and hooks) for
> > > > > > > > > > > > > > > > > > > > > > interacting with Cloudera
> > > > > > > > > > > > > > > > > > > > > > components that we'd like to
> > > > > > > > > > > > > > > > > > > > > > contribute to the project. The
> > > > > > > > > > > > > > > > > > > > > > provider FAQs indicate that new
> > > > > > > > > > > > > > > > > > > > > > provider contributions are still
> > > > > > > > > > > > > > > > > > > > > > welcome in the project in 2.x, is
> > > > > > > > > > > > > > > > > > > > > > that accurate?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks in advance!
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Ian