Hi Jarek,

The plan sounds great! And +1 to a special interest group. Please add me to
the group if you do create one.

Here is the doc ( Airflow Multi-tenancy discussion
<https://docs.google.com/document/d/17kgfLO2fpNC62YuCxo1l0yRipt48YE6g1d1SGUok2I8/edit#heading=h.d14mqct3autb>
)
we used to discuss back in April. It's not a note per-se, but I think it
can shed some light on what we talked about. Other folks may have an actual
note or even a draft proposal on this topic.

  I'm excited for us to move forward with this.

Bin

On Fri, Nov 5, 2021 at 10:38 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Hello Ian, Everyone,
>
> I wonder if there are any notes from the meeting in April? Has there
> been any more work on that one from Cloudera to formalize and plan
> work on it?
>
> I was not able to participate, but I think it's about the time to
> seriously start work on that and I am super happy to take more lead on
> this project and involve all the interested parties. The ideas
> described in the email and discussed after are I think super
> reasonable and definitely necessary to get to the multi-tenancy and I
> believe that there are already ideas that can be turned into reality
> rather soon. I had a talk today also with the Google Composer team and
> they are also fully on board with dedicating a lot of effort on this
> one (and their ideas are I think super-aligned with Cloudera's), so I
> think we have a critical mass and engineering power to make it happen
> :)
>
> I plan to put quite a lot of focus on that one over the coming months
> and I am happy to lead or co-lead the AIP and take a big part in
> implementation.
>
> Possibly we should create a special interest group around that and
> start drafting the AIP proposals in a smaller group of people who are
> interested and start planning the work. I already have some ideas
> where we could start gradually implementing it (of course after we
> prepare the AIP and get it through the community's approval process).
>
> How does it sound?
>
> J.
>
> On Wed, Apr 21, 2021 at 8:56 AM Ian Buss <ianjb...@gmail.com> wrote:
> >
> > Yes, no invite required. See you tomorrow!
> > On 21 Apr 2021, 07:46 +0100, Sumit Maheshwari <msu...@apache.org>,
> wrote:
> >
> > I'll join as well (I believe the zoom link will work without an invite)
> >
> > On Wed, Apr 21, 2021 at 10:48 AM Dimitris Stafylarakis <xan...@gmail.com>
> wrote:
> >>
> >> hi all,
> >>
> >> great to read about this, I'd like to join in! Can I just join using
> the zoom link tomorrow or do I need an invitation? (If I do need one,
> please invite me :))
> >>
> >> cheers
> >>
> >>
> >> On Wed, Apr 14, 2021 at 8:15 PM Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
> >>>
> >>> Thank you Ian,
> >>>
> >>> I’ve invited everyone on this thread to the meeting with that zoom
> link. Anyone else who wants to join can add the calendar event here
> calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=dan...@astronomer.io
> >>>
> >>> On Wed, Apr 14, 2021 at 11:05 AM, Ian Buss <ianjb...@gmail.com> wrote:
> >>>
> >>> If this works for everyone, here's a zoom link for Thursday 8AM PST:
> https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09
> >>>
> >>> Happy to move or use an alternate method as needed.
> >>>
> >>> On Wed, Apr 14, 2021 at 6:58 PM Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
> >>>>
> >>>> Thursday works for me!
> >>>>
> >>>> On Wed, Apr 14, 2021 at 10:05 AM, Ian Buss <ianjb...@gmail.com>
> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I actually can’t do Wednesday next week as I’m moving house :) Any
> chance we could do Thursday or Friday at the same time?
> >>>>
> >>>> Cheers
> >>>>
> >>>> Ian
> >>>> On 14 Apr 2021, 17:49 +0100, Kaxil Naik <kaxiln...@gmail.com>, wrote:
> >>>>
> >>>> Just few comments here:
> >>>>
> >>>> Currently -- atleast for the foreseeable future Airflow workers will
> need access to the DAG Files, so workers can not run using the Serialized
> DAGs.
> >>>>
> >>>> Also serialized DAGs do not even have all the info needed for it to
> run it. Currently the serialization happens in the parsing process in the
> scheduler which can be in future separated as a separator "parsining"
> component, but that won't solve the "isolation" problem you are trying to
> solve. The only current way it can be solved is pickling -- and we have
> strictly decided against using pickling for DAGs.
> >>>>
> >>>> The idea in Statement (2) & (3) would help solve the isolation
> problem in (1) and can be done with some work now.
> >>>>
> >>>> Happy to talk about it in more detail here or on call, the time
> Daniel suggested works for me.
> >>>>
> >>>> Regards,
> >>>> Kaxil
> >>>>
> >>>> On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
> >>>>>
> >>>>> How about Wednesday, April 21 at 8:00AM PST?
> >>>>>
> >>>>> On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang <bin.huan...@gmail.com>
> wrote:
> >>>>>
> >>>>> I am available any days.
> >>>>>
> >>>>> On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi everyone!
> >>>>>>
> >>>>>> Would people be available around 8AM/9AM PST some point next week?
> I’m in PST and Ian is UTC+1 so would be great to find a timezone that
> accomodates everyone.
> >>>>>>
> >>>>>> Daniel
> >>>>>> On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter <ryannhat...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> I’d also like to be added please :)
> >>>>>>
> >>>>>> On Apr 13, 2021, at 21:27, Xinbin Huang <bin.huan...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> 
> >>>>>> Hi Daniel & Ian,
> >>>>>>
> >>>>>> I am also interested in the idea of a serialization representation
> that can be executed by workers directly. Can you also add me to the call?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Bin
> >>>>>>
> >>>>>> On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> Daniel,
> >>>>>>>
> >>>>>>> Thanks for your warm welcome and quick response and the advice on
> providers! Will certainly check out the examples you sent.
> >>>>>>>
> >>>>>>> 1. An "airflow register" command definitely sounds promising,
> would love to collaborate on an AIP there so let's set something up.
> >>>>>>> 2. We use KubernetesExecutor exclusively as well. We've noticed
> significant additional load on the metadata DB as we scale up task pods so
> I've also thought about an API-based approach. Such an API could also open
> up the possibility of per-task security tokens which are injected by the
> scheduler, which should improve the security of such a system. Food for
> thought at least. I will start putting some of these thoughts down on paper
> in a sharable format.
> >>>>>>>
> >>>>>>> Ian
> >>>>>>>
> >>>>>>> On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Ian,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Firstly, welcome to the Airflow community :). I'm glad to hear
> you've had a positive experience so far. It's great to hear that you want
> to contribute back, and I think that multi-tenancy/DAG isolation is a
> pretty fantastic project for the community as a whole (a lot of things are
> are things we want but are limited by hours in a day).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 1. I've personally been kicking around some ideas lately about an
> "airflow register" command that would write the DAG into the metadata DB in
> a way that could be "gettable" by the workers via the API. This work is
> very early. I'd love to get some help on it. Perhaps we can set up a zoom
> chat to discuss drafting an AIP?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2. Limiting worker access to the DB is not only good security
> practice; it also opens up the door to a lot of valuable features. This
> feature would be especially close to my heart as it would make the
> KubernetesExecutor significantly more efficient. It should be possible to
> set up a system where the workers only ever speak to an API server and
> never need to touch the DB.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 3. This is not something I personally have insight into, but I
> think it sounds like a good idea.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Finally, addressing your question about a Cloudera provider. If
> anything, it would probably give the provider _more_ legitimacy if you
> hosted it under the Cloudera GitHub org (we very purposely created the
> provider packages with this workflow in mind). There are multiple places
> where we can work to surface this provider so it is easy to find and use.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Astronomer has a pretty good sample provider here. One example of
> it running in the wild is the Great Expectations provider here. I'd also be
> glad to get you in contact with people who have built providers in the past
> to help you with that process.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Looking forward to seeing some of these things come to fruition!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Daniel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> First a quick introduction: I'm an engineer with Cloudera working
> on our Data Engineering product (CDE). Airflow is working great for us so
> far. We've been looking into how we can enhance the multi-tenancy story of
> Apache Airflow as we currently deploy it. We have the following areas which
> we'd like (with community consensus) to work on and contribute back to
> Apache Airflow to enhance the isolation between tenants in a single Airflow
> deployment.
> >>>>>>>>
> >>>>>>>> 1. Isolating code execution and parsing of DAG files. At the
> moment, DAG files are parsed in a few locations in Airflow, including the
> scheduler and in tasks. There is already the concept of DAG serialization
> (and we're using that for the web component) but we'd be interested to see
> if we can sandbox the execution of arbitrary user code to a locked down
> process/container without full access to the metadata DB and connection
> secrets etc. The idea would be to parse and serialize the DAG in this
> isolated container and pass back a serialized representation for
> persistence in the DB. Has anyone explored this idea?
> >>>>>>>>
> >>>>>>>> 2. Limiting task access to the metadata DB. It would be great if
> we could remove the requirement for tasks to have full access to the
> metadata DB and to report task status in a different (but still scalable)
> way. We'd need to tackle access or injection of connection, variable and
> xcom data as well for each task naturally.
> >>>>>>>>
> >>>>>>>> 3. Finer-grained access controls on connection secrets. Right
> now, although there are nice at-rest encryption options with Fernet or
> Vault, IIUC any DAG can access any connection (and thus any secret). Since
> the "run as" user is largely defined within the DAG and its tasks, this is
> challenging for a multi-tenant environment (see caveat below)
> >>>>>>>>
> >>>>>>>> Caveat: It's definitely noted that to some extent we should
> assume that an Airflow deployment is a "trusted" environment and that best
> practices such as git+PR workflows are the gold standard and that any
> malicious code and dependencies should be identified through this process.
> Also that there is a clear admin role for connection management etc.
> >>>>>>>>
> >>>>>>>> We have some ideas informally sketched out as to how to address
> the above but would be keen to hear the community opinion on this and to
> see if anyone is keen to collaborate on designs and implementation, or to
> hear if anything is already in the works. In particular I noticed that the
> very first improvement proposal (AIP-1) addresses much of the above :).
> However, it seems fairly dormant at the moment.
> >>>>>>>>
> >>>>>>>> One other question: we have a provider (operators and hooks) for
> interacting with Cloudera components that we'd like to contribute to the
> project. The provider FAQs indicate that new provider contributions are
> still welcome in the project in 2.x, is that accurate?
> >>>>>>>>
> >>>>>>>> Thanks in advance!
> >>>>>>>>
> >>>>>>>> Ian
>

Reply via email to