Re: [DISCUSS] AIP-67 Multi-team deployment of Airflow components (reloaded)

Jarek Potiuk Fri, 19 Jul 2024 13:50:13 -0700

> 1. The roles and responsibilities of what you have called "Organization
> Deployment Managers" vs. "Team Deployment Managers". This is somewhat
> different than the model I have seen in practice, so trying to reconcile
> this in my head. If this is a terminology difference or a role difference
> or where exactly the line is drawn.
>
>
This is what is not possible in current practice, because we have no such
distinction. Currently there is only one deployment manager role, but at
least with my talks to some users and with the approach I proposed, it's
possible that the "oganization deployment manager" (a.k.a Data Platform
Team) prepares and manages Airflow as a "whole". - i.e. Running
airflow scheduler, webserver, connection to the organisation identity
management system. But then - those people are struggling when different
teams have different expectations for what dependencies, OS, queues, GPU
access etc. and Data Platform "Organisation Deployment Manager" wants to
delegate that to each team's deployment manager - such "team DM" should be
able - for example to manage their own K8S cluster where their team's job
will be running - with appropriate dependencies/hardware resources etc.
And important  aspect is that each team can manage it without "bothering"
the Platform team (but still will be trusted enough to allow installing
arbitrary packages),- and the platform team will mostly worry about airflow
itself.



> 2. DAG File Processing
> I am quite perplexed by this and wondering about the overlap between the
> "execution bundles" concept which Jed had defined as part of AIP-66 and
> since deleted from there.
>

Actually that was also the result of discussion we had in the doc with Ash
and Jed. And I see the potential of joining it. The AIP-66 defined bundle
environment that could be specified - but the problem with that was that a)
it assumed it is based on "pip install --target" venvs (which has a host of
potential problems I explained) and also b) it skipped over on the
management part of those - i.e. who prepares those environments and manages
them - is this a DAG author ? if so, this opens  a host of security issues
- because those environments are essentially "binary" and "not reviewable"
- so there must be some role responsible for it in our security model
(i.e... "bundle deployment manager"? ) - which sounds suspiciously similar
"team deployment manager". One of the important differences of AIP-67 here
is that it explicitly explains and allows to separate all three
"vulnerable" components per "team"  - > DAG file Processor,. Triggerer,.
Worker. AIP-66 does not mention that. Implicitly assuming that running the
code from different bundles can be done on shared machines (DAG file
processor/Schediuler) or even in the same process (Triggerer).  AIP-67 adds
an explicit, strong isolation of code execution between the teams - so that
the code from different teams are not executed on the same machines,
containers, but they can be easily separated (by deployment options).
Without AIP-67 it is impossible  - or very difficult and error-prone - to
make sure that the code of one "bundle" cannot leak to the other "bundle".
We seem to silently skip over the fact that both DAG File Processor and
Triggerrer can execute the code in the same machine or process and that
isolation in such a case is next to impossible. And this is a very
important aspect of "enterprise security isolation". In some ways even more
important than isolating access to the same DB.

And I actually am quite open to joining it - and have either "bundle =
team" or "bundle belongs to team".



> I will read this doc again for sure.
> It has grown and evolved so much that at least for me it was quite
> challenging to grasp. Thanks for working through this.
>
> Vikram
>
>
> On Thu, Jul 18, 2024 at 9:39 PM Amogh Desai <amoghdesai....@gmail.com>
> wrote:
>
> > Nice, thanks for clarifying all this!
> >
> > Now that I read the new proposal, it is adding up to me why certain
> > decisions were made.
> > The decision to separate the "common" part from the "per team" part adds
> up
> > now. It
> > is a traditional paradigm of separating "control plane" from "compute".
> >
> > Thanks & Regards,
> > Amogh Desai
> >
> >
> > On Mon, Jul 15, 2024 at 8:53 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > > I got the transcript and chat from the last call (thanks Kaxil!) and it
> > > allowed me to answer a few questions that were asked during my
> > presentation
> > > about AIP-67. I updated the AIP document but here is summary:
> > >
> > > 1) What about Pools (asked by Elad and Jed, Jorrick): I thought about
> it
> > > and I propose that pools could have (optional) team_id added. This will
> > > allow users to keep common pools (no team_id assigned) and have
> > > team-specific ones. DAG file processor specific for each team will fail
> > DAG
> > > if it tries using a pool that is not common and belonging to "other
> > team".
> > > Also each team will be able to have their own "default_pool"
> configured.
> > > This will give enough flexibility on 'common vs. team exclusive" use of
> > > pools.
> > >
> > > 2) Isolation for connections (John, Filip, Elad, Kaxil, Amogh, Ash):
> yes.
> > > That is part of the design. The connections and variables can be
> accessed
> > > per team - AIP-72 will only provide the tasks with connections that
> > belong
> > > to the team. Ash mentioned OPA (which might be used for that purpose).
> > It's
> > > not defined how exactly it will be implemented in AIP-72, it's not
> > detailed
> > > enough, but it can use the very mechanisms that AIP-72 - by only
> allowing
> > > "global" connections and "my team" connections to be passed by AIP-72
> API
> > > to the task and DAG file processor.
> > >
> > > 3) Whether "Team=deployment" - Igor / Vikram ? -> depends on what you
> > > understand by deployment. i'd say "sub-deployment" - each deployment
> in a
> > > "multi-team" environment will consist of the "common" part and each
> team
> > > will have their own part (where configuration and management of such
> team
> > > deployment parts will be delegated to the team deployment manager). For
> > > example such deployment managers will be able to build and publish the
> > > environment (for example container images) used by team A to run
> Airflow.
> > > Or change "team" specific configuration.
> > >
> > > 4) "This seems like quite a lot of work to share a scheduler and a web
> > > server. What’s the net benefit of this complexity?" -> Ash, John,
> Amogh,
> > > Maciej: Yes. I absolutely see it as a valuable option. It reflects
> > > organizational structure and needs of many of our users, where they
> want
> > to
> > > manage part of the environment, monitoring of what's going in all of
> > their
> > > teams centrally (and manage things like upgrades of Airflow, security
> > > centrally), while they want to delegate control of environments and
> > > resources down to their teams. This is the need that I've heard from
> many
> > > users who have a "data platform team" that makes Airflow available to
> > their
> > > several teams. I think the proposal I have is a nice middle ground that
> > > follows Conway's law - that architecture of your system should reflect
> > your
> > > organizational structure - and what I separated out as "common" parts
> is
> > > precisely what "data platform team" would like to manage, where "team
> > > environment" is something that data platform should (and want to)
> > delegate
> > > to their teams.
> > >
> > > 5) "I am a little surprised by a shared dataset" - Vikram/Elad : The
> > > datasets are defined by their URLs and as such - they don't have
> > > "ownership". As I see it - It's really important who can trigger a DAG
> > and
> > > the controls I proposed allow the DAG author to specify "In this DAG
> it's
> > > also ok when a different team (specified) triggered the dataset event".
> > But
> > > I left a note that it is AIP-73-dependent "Expanded Data Awareness" and
> > > once we get that explained/clarified I am happy to coordinate with
> > > Constance and see if we need to do more. Happy to hear more comments on
> > > that one.
> > >
> > > I reflected the 2 points and 5)  in the AIP. Looking forward to more
> > > comments on the proposal - in the AIP or here.
> > >
> > > J.
> > >
> > >
> > >
> > >
> > > On Tue, Jul 9, 2024 at 4:48 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > I would like to resume discussion on AIP-67. After going through a
> > > > number of discussions and clarifications about the scope of Airflow
> 3,
> > > > I rewrote the proposal for AIP-67 with the assumption that we will do
> > > > it for Airflow 3 only - and that it will be based on the new proposed
> > > > AIP-72 (Task Execution Interface) rather than Airflow 2-only AIP-44
> > > > Internal API.
> > > >
> > > > The updated proposal is here
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components
> > > >
> > > > Feel free to comment there in-line or raise your "big" comments"
> here,
> > > > but here is the impact of changing the target to Airflow 3:
> > > >
> > > > 1) I proposed to change configuration of Airflow to use more
> > > > structured TOML than plain "ini" - toml is a successor of "ini" and
> is
> > > > largely compatible, but it has arrays, tables and nesting, has good
> > > > support in Python and is "de-facto" standard for configuration now
> > > > (pyproject.toml and the like). This was far too big of a change for
> > > > Airflow 2 but with Airflow 3 it seems very appropriate.
> > > >
> > > > 2) On a popular request I added "team_id" as a database field - this
> > > > has quite a few far-reaching implications and it's ripple-effect on
> > > > Airflow 2 would be far too big for the "limited" multi-team setup -
> > > > but since we are going to do full versioning including DB changes in
> > > > Airflow 3, this is an opportunity to do it well. The implementation
> > > > detail of it will however depend on our choice of supported databases
> > > > so there is a little dependency on other decisions here. If we stick
> > > > with both Postgres and MySQL we will likely have to restructure the
> DB
> > > > to have synthetic UUID identifiers in order to add both versioning
> and
> > > > multi-team (because of MySQL index limitations).
> > > >
> > > > 3) The "proper" team identifier also allows to expand the scope of
> > > > multi-team to also allow "per-team" connections and variables. Again
> > > > for Airflow 2 case we could limit it to only the case where
> > > > connections and variables comes only from "per-team" secrets - but
> > > > since we are going to have DB identifiers and we are going to -
> anyhow
> > > > - reimplement Connections and Variables UI to get rid of FAB models
> > > > and implement them in reactive technology, it's only a bit more
> > > > complex to add "per-team" access there.
> > > >
> > > > 4) AIP-72 due to its "task" isolation, allows dropping the idea about
> > > > the "--team" flag from the components. With AIP-72 routing tasks to
> > > > particular "team" executors" is enough and there is no need to pass
> > > > the team information via "--team" flag that was originally supposed
> to
> > > > limit access of the components to only a single team. For Airflow 2
> > > > and AIP-44 that was a nice "hack" so that we do not have to carry the
> > > > "authorization" information together with the task. But since part of
> > > > AIP-72 is to carry the verifiable meta-data that will allow us to
> > > > cryptographically verify task provenance, we can drop this hack and
> > > > rely on AIP-72 implementation.
> > > >
> > > > 5) since DB isolation is "given" by AIP-72, we do not have to split
> > > > the delivery of AIP-67 into two phases (with and without DB
> isolation)
> > > > - it will be delivered as a single "with DB isolation" stage.
> > > >
> > > > Those are the major differences vs. the proposal from May ( and as
> you
> > > > might see it is quite a different scope - and this is really why I
> > > > insisted on having Airflow 2/ Airflow 3 discussion before we conclude
> > > > the vote on it.
> > > >
> > > > I will go through the proposal on Thursday during our call as planned
> > > > - but feel free to start discussions and comments before.
> > > >
> > > > J.
> > > >
> > >
> >
>

Re: [DISCUSS] AIP-67 Multi-team deployment of Airflow components (reloaded)

Reply via email to