Re: [DISCUSS] AIP-67 Multi-team deployment of Airflow components (reloaded)

Vikram Koka Sun, 28 Jul 2024 15:17:43 -0700

Thanks Jarek, both the comments you added above make sense and help me
understand the steps here.


I was definitely struggling with the dependencies of this AIP on AIP-72
(task isolation) and AIP-66(DAG Bundles).
Specifically, there were a couple of references which were key in my mind.
- Workload isolation which I do think is very clearly understood in AIP-72
- DAGFileProcessor changes which I hadn't quite thought through in the
context of multi-team, which may touch both AIP-66 and AIP-72.

I think I understand your concept of team very clearly and the concept of
"team configuration".
I did find two "per-team configuration principles" somewhat at conflict
though:
1. *Each team configuration SHOULD be a separate configuration file or
separate set of environment variables, holding team-specific configuration
needed by DAG file processor, Workers and Triggerer*
2. In multi-team deployment, Connections and Variables have (nullable)
team_id field - which makes them either belonging to a specific team or
"globally available".

Personally, I would have thought that you would want to standardize on
either explicit configuration required or implicit allowed throughout
rather than mixing. But, you probably have reasons for why a mixed approach
is suitable here.

I also agree with your comment on Scheduling fairness for now

I assume that the DAG Parsing per-team environment is where you see the
need to integrate with AIP-66 (DAG Bundles ...)
Is that correct?


On Thu, Jul 25, 2024 at 5:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> One more thing to add - if there will be no more comments, I will start a
> vote soon - but maybe some clarifications might help - I spoke to few other
> people about it:
>
> * this one is heavily based on other AIPs that are also part of Airlfow 3.
> And while some parts are independent, AIP-72 (Task Isolation) and AIP-62
> (Parsing and bundling) and some details of those might have impact on some
> of the details - and maybe what will happen those dependencies will **not**
> complete in time for Airflow 3.0, So I am happy to mark it as Airflow 3.0
> (or 3.1 depending on dependencies)
> * the AIP is not trying to address "huge" installations with multiple
> tenants. Those are better served by writing extra UI layers and
> more-or-less "hiding" airflow effectively behind the corporate "solution"
> for those who can afford it. It's mostly for "mid-size" deployments, where
> there are enough separate (but interdependent) teams to manage but where
> the users won't have the capacity to make their own "uber-airflow" on top
> and where Airlfow UI might be useful to manage all the pipelines from those
> teams together. I might want to clarify it if that helps to convince those
> unconvinced.
>
> J
>
>
> On Tue, Jul 23, 2024 at 9:32 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > I responded to some comments and had a long discussion on Slack with Ash
> > and I would love to hear even more comments and see if my responses are
> > satisfactory (so I would love to get confirmation/further comments on
> those
> > threads opened as discussion points - before I put it to a vote). There
> are
> > 18 inline comments now that wait for "yeah looks good" or "no I have
> still
> > doubts":
> >
> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP
> > -67+Multi-team+deployment+of+Airflow+components
> >
> > Things that I wanted to address as general "big points" and explain them
> > here - so that others can make their comments on it here as well.
> >
> > 1) (Mostly Ash)* Is it worth it? Should it be part of the community
> > effort ?* a) On one hand AIP-72 gives a lot of isolation, b) AIP-66
> > Bundles and Parsing **might** introduce separate execution environments.
> c)
> > On the other hand users can wrap around multiple Airflow instances if the
> > only "real" benefit is a single "UI/management" endpoint.
> >
> > My answer - I think it still is.
> >
> > a) AIP-72 does give isolation (same as AIP-44 only more optimized and
> > designed for scratch), but it does not allow separation between teams for
> > DAG file processors. AIP-67 effectively introduces it by allowing each
> team
> > to have separate DAG file processors - this is possible even today - but
> > then it also allows them to effectively use a per-team environment -
> > because via team_id specific executors all the tasks of all the DAGs of
> the
> > team might be executing in their dedicated environment (so for example
> > whole team might use the same container image for processors and all the
> > workers they have in Celery. This is currently not part of AIP-72
> >
> > b) While AIP-66 had initially the concept of a per-bundle environment, I
> > thought having it defined by the DAG author and the way how it could be
> > managed is problematic. Especially if we allow to have "any" binary
> > environment (whether it's container image or zip containing venv) it has
> > security issue (unreviewable dependencies) and performance issues for
> > zipped venv (installing such a bundle takes time, and environmental
> issues
> > involving shared libraries cases and architectural differences are
> > basically unsurmountable - for example that would make it very difficult
> to
> > test same DAGs across different architectures and even different versions
> > of the same OS. Also it was not clear who and how would manage those
> > environments (i.e. roles of actors involved). While this is a great
> idea, I
> > think it's too granular to be effectively managed "per team" - while it
> is
> > great for some exceptional cases where we can say "this DAG is so special
> > that it needs a separate environment". I strongly believe the "per team"
> > environment is a much more common case (and it coincides with "working
> as a
> > team" - where the whole team should be able to easily run and develop
> > various DAGs. Conway's law in full swing. AIP-67 is addressing all that -
> > various environments per team, Team Deployment Manager is introduced as
> > Actor who manages those environments and team configuration.
> >
> > c) while yes, it's possible to wrap around multiple Airflows with an
> extra
> > UI layer and have them managed in common - but it either has to provide a
> > version of Airflow UI "clone" where you can see several DAGs - possibly
> > interacting with each other, or you have to have some weird ways of easy
> > switching or showing multiple Airflow UIs at the same time. First is
> going
> > to be super-costly to manage over time as Airflow will evolve, second is
> a
> > band-aid really IMHO. And when I think about it - I think mostly about
> > on-premise users, not managed-services users. Yes managed services could
> > afford such extra layers and it could be thought of as a nice "paid"
> > feature of Airflow services. So maybe indeed we should think about it as
> > "not in community" ? But  I know a number of users who have on-premise
> > Airflow (and for good reasons cannot move to managed Airflow) and we cut
> > all of them off from being able to have a number of their teams having
> > pipelines isolated from environment and security point of view, while
> > centrally monitored and managed.
> >
> > 2) (Vikram, Constance and TP ): *Permissions for Data Assets (related to
> > AIP-73 + sub-aips)* - I proposed that each team's Data Assets are
> > separated (same as for DAGs automatically team_id is assigned when
> parsing
> > the DAGs by DAG File Processor) - and that they are effectively "separate
> > namespaces". We have no proposal for permissions of Data Assets (yet) -
> it
> > might come later - that's why I do not want to interfere with it and
> > proposed a simple solution where Data Assets can be matched by URIs
> (which
> > become optional in AIP-73) and that the DAG using Data Asset with same
> URI
> > might get triggered by a different Data Set from another team but the
> same
> > URI only if the "triggered" DAG explicitly hase `allowed_triggering_by =
> [
> > "team_2", "team 3"]` (or similar).
> >
> > 3) (Niko, Amogh, TP) - *Scope of "access" to team DAGs/folders/managing
> > access* - I assume from DAG file processing POV "folder(s)" is currently
> > the right abstraction (and already handled by standalone
> DAGFileProcessor).
> > The nice thing about it is that control over the folder "access" can (and
> > should) be done outside of Airflow - we delegate it out. Similarly like
> > Auth Manager access - we do not want to record team members who have
> access
> > - but we want to delegate it out. We do not want to keep information
> about
> > who is in which team nor handle team changes - this should all be handled
> > in the corporate identity system. This way both sides are delegated out -
> > DAG authoring and DAG management. Airflow only knows "team_id", nothing
> > else once the DAG is parsed (DAG and DAG Assets created by parsing the
> DAG
> > have team_id assigned automatically). And In the future we might expand
> it
> > - when we have - for example - declarative DAGs submitted via API, the
> API
> > can have "team_id" added or when it is created via some kind of factory -
> > that factory might assign their ids. The "folder" that we have now is
> > simply to make use of DAG file processor --subdir flag - but actually it
> > does not matter - once the DAG gets serialized - the DAG and it's assets
> > will just have "team_id" assigned, and they do not have to come from the
> > same subdir.
> >
> > Looking forward to closing those discussions and putting it finally up to
> > a vote.
> >
> > J.
> >
> >
> >
> >
> >
> > On Fri, Jul 19, 2024 at 10:49 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> >>
> >> 1. The roles and responsibilities of what you have called "Organization
> >>> Deployment Managers" vs. "Team Deployment Managers". This is somewhat
> >>> different than the model I have seen in practice, so trying to
> reconcile
> >>> this in my head. If this is a terminology difference or a role
> difference
> >>> or where exactly the line is drawn.
> >>>
> >>>
> >> This is what is not possible in current practice, because we have no
> such
> >> distinction. Currently there is only one deployment manager role, but at
> >> least with my talks to some users and with the approach I proposed, it's
> >> possible that the "oganization deployment manager" (a.k.a Data Platform
> >> Team) prepares and manages Airflow as a "whole". - i.e. Running
> >> airflow scheduler, webserver, connection to the organisation identity
> >> management system. But then - those people are struggling when different
> >> teams have different expectations for what dependencies, OS, queues, GPU
> >> access etc. and Data Platform "Organisation Deployment Manager" wants to
> >> delegate that to each team's deployment manager - such "team DM" should
> be
> >> able - for example to manage their own K8S cluster where their team's
> job
> >> will be running - with appropriate dependencies/hardware resources etc.
> >> And important  aspect is that each team can manage it without
> "bothering"
> >> the Platform team (but still will be trusted enough to allow installing
> >> arbitrary packages),- and the platform team will mostly worry about
> airflow
> >> itself.
> >>
> >>
> >>> 2. DAG File Processing
> >>> I am quite perplexed by this and wondering about the overlap between
> the
> >>> "execution bundles" concept which Jed had defined as part of AIP-66 and
> >>> since deleted from there.
> >>>
> >>
> >> Actually that was also the result of discussion we had in the doc with
> >> Ash and Jed. And I see the potential of joining it. The AIP-66
> >> defined bundle environment that could be specified - but the problem
> with
> >> that was that a) it assumed it is based on "pip install --target" venvs
> >> (which has a host of potential problems I explained) and also b) it
> skipped
> >> over on the management part of those - i.e. who prepares those
> >> environments and manages them - is this a DAG author ? if so, this
> opens  a
> >> host of security issues - because those environments are essentially
> >> "binary" and "not reviewable" - so there must be some role responsible
> for
> >> it in our security model (i.e... "bundle deployment manager"? ) - which
> >> sounds suspiciously similar "team deployment manager". One of the
> important
> >> differences of AIP-67 here is that it explicitly explains and allows to
> >> separate all three "vulnerable" components per "team"  - > DAG file
> >> Processor,. Triggerer,. Worker. AIP-66 does not mention that. Implicitly
> >> assuming that running the code from different bundles can be done on
> shared
> >> machines (DAG file processor/Schediuler) or even in the same process
> >> (Triggerer).  AIP-67 adds an explicit, strong isolation of code
> execution
> >> between the teams - so that the code from different teams are not
> executed
> >> on the same machines, containers, but they can be easily separated (by
> >> deployment options). Without AIP-67 it is impossible  - or very
> difficult
> >> and error-prone - to make sure that the code of one "bundle" cannot
> leak to
> >> the other "bundle". We seem to silently skip over the fact that both DAG
> >> File Processor and Triggerrer can execute the code in the same machine
> or
> >> process and that isolation in such a case is next to impossible. And
> this
> >> is a very important aspect of "enterprise security isolation". In some
> ways
> >> even more important than isolating access to the same DB.
> >>
> >> And I actually am quite open to joining it - and have either "bundle =
> >> team" or "bundle belongs to team".
> >>
> >>
> >>
> >>> I will read this doc again for sure.
> >>> It has grown and evolved so much that at least for me it was quite
> >>> challenging to grasp. Thanks for working through this.
> >>>
> >>> Vikram
> >>>
> >>>
> >>> On Thu, Jul 18, 2024 at 9:39 PM Amogh Desai <amoghdesai....@gmail.com>
> >>> wrote:
> >>>
> >>> > Nice, thanks for clarifying all this!
> >>> >
> >>> > Now that I read the new proposal, it is adding up to me why certain
> >>> > decisions were made.
> >>> > The decision to separate the "common" part from the "per team" part
> >>> adds up
> >>> > now. It
> >>> > is a traditional paradigm of separating "control plane" from
> "compute".
> >>> >
> >>> > Thanks & Regards,
> >>> > Amogh Desai
> >>> >
> >>> >
> >>> > On Mon, Jul 15, 2024 at 8:53 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> >>> >
> >>> > > I got the transcript and chat from the last call (thanks Kaxil!)
> and
> >>> it
> >>> > > allowed me to answer a few questions that were asked during my
> >>> > presentation
> >>> > > about AIP-67. I updated the AIP document but here is summary:
> >>> > >
> >>> > > 1) What about Pools (asked by Elad and Jed, Jorrick): I thought
> >>> about it
> >>> > > and I propose that pools could have (optional) team_id added. This
> >>> will
> >>> > > allow users to keep common pools (no team_id assigned) and have
> >>> > > team-specific ones. DAG file processor specific for each team will
> >>> fail
> >>> > DAG
> >>> > > if it tries using a pool that is not common and belonging to "other
> >>> > team".
> >>> > > Also each team will be able to have their own "default_pool"
> >>> configured.
> >>> > > This will give enough flexibility on 'common vs. team exclusive"
> use
> >>> of
> >>> > > pools.
> >>> > >
> >>> > > 2) Isolation for connections (John, Filip, Elad, Kaxil, Amogh,
> Ash):
> >>> yes.
> >>> > > That is part of the design. The connections and variables can be
> >>> accessed
> >>> > > per team - AIP-72 will only provide the tasks with connections that
> >>> > belong
> >>> > > to the team. Ash mentioned OPA (which might be used for that
> >>> purpose).
> >>> > It's
> >>> > > not defined how exactly it will be implemented in AIP-72, it's not
> >>> > detailed
> >>> > > enough, but it can use the very mechanisms that AIP-72 - by only
> >>> allowing
> >>> > > "global" connections and "my team" connections to be passed by
> >>> AIP-72 API
> >>> > > to the task and DAG file processor.
> >>> > >
> >>> > > 3) Whether "Team=deployment" - Igor / Vikram ? -> depends on what
> you
> >>> > > understand by deployment. i'd say "sub-deployment" - each
> deployment
> >>> in a
> >>> > > "multi-team" environment will consist of the "common" part and each
> >>> team
> >>> > > will have their own part (where configuration and management of
> such
> >>> team
> >>> > > deployment parts will be delegated to the team deployment manager).
> >>> For
> >>> > > example such deployment managers will be able to build and publish
> >>> the
> >>> > > environment (for example container images) used by team A to run
> >>> Airflow.
> >>> > > Or change "team" specific configuration.
> >>> > >
> >>> > > 4) "This seems like quite a lot of work to share a scheduler and a
> >>> web
> >>> > > server. What’s the net benefit of this complexity?" -> Ash, John,
> >>> Amogh,
> >>> > > Maciej: Yes. I absolutely see it as a valuable option. It reflects
> >>> > > organizational structure and needs of many of our users, where they
> >>> want
> >>> > to
> >>> > > manage part of the environment, monitoring of what's going in all
> of
> >>> > their
> >>> > > teams centrally (and manage things like upgrades of Airflow,
> security
> >>> > > centrally), while they want to delegate control of environments and
> >>> > > resources down to their teams. This is the need that I've heard
> from
> >>> many
> >>> > > users who have a "data platform team" that makes Airflow available
> to
> >>> > their
> >>> > > several teams. I think the proposal I have is a nice middle ground
> >>> that
> >>> > > follows Conway's law - that architecture of your system should
> >>> reflect
> >>> > your
> >>> > > organizational structure - and what I separated out as "common"
> >>> parts is
> >>> > > precisely what "data platform team" would like to manage, where
> "team
> >>> > > environment" is something that data platform should (and want to)
> >>> > delegate
> >>> > > to their teams.
> >>> > >
> >>> > > 5) "I am a little surprised by a shared dataset" - Vikram/Elad :
> The
> >>> > > datasets are defined by their URLs and as such - they don't have
> >>> > > "ownership". As I see it - It's really important who can trigger a
> >>> DAG
> >>> > and
> >>> > > the controls I proposed allow the DAG author to specify "In this
> DAG
> >>> it's
> >>> > > also ok when a different team (specified) triggered the dataset
> >>> event".
> >>> > But
> >>> > > I left a note that it is AIP-73-dependent "Expanded Data Awareness"
> >>> and
> >>> > > once we get that explained/clarified I am happy to coordinate with
> >>> > > Constance and see if we need to do more. Happy to hear more
> comments
> >>> on
> >>> > > that one.
> >>> > >
> >>> > > I reflected the 2 points and 5)  in the AIP. Looking forward to
> more
> >>> > > comments on the proposal - in the AIP or here.
> >>> > >
> >>> > > J.
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Tue, Jul 9, 2024 at 4:48 PM Jarek Potiuk <ja...@potiuk.com>
> >>> wrote:
> >>> > >
> >>> > > > Hello Everyone,
> >>> > > >
> >>> > > > I would like to resume discussion on AIP-67. After going through
> a
> >>> > > > number of discussions and clarifications about the scope of
> >>> Airflow 3,
> >>> > > > I rewrote the proposal for AIP-67 with the assumption that we
> will
> >>> do
> >>> > > > it for Airflow 3 only - and that it will be based on the new
> >>> proposed
> >>> > > > AIP-72 (Task Execution Interface) rather than Airflow 2-only
> AIP-44
> >>> > > > Internal API.
> >>> > > >
> >>> > > > The updated proposal is here
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components
> >>> > > >
> >>> > > > Feel free to comment there in-line or raise your "big" comments"
> >>> here,
> >>> > > > but here is the impact of changing the target to Airflow 3:
> >>> > > >
> >>> > > > 1) I proposed to change configuration of Airflow to use more
> >>> > > > structured TOML than plain "ini" - toml is a successor of "ini"
> >>> and is
> >>> > > > largely compatible, but it has arrays, tables and nesting, has
> good
> >>> > > > support in Python and is "de-facto" standard for configuration
> now
> >>> > > > (pyproject.toml and the like). This was far too big of a change
> for
> >>> > > > Airflow 2 but with Airflow 3 it seems very appropriate.
> >>> > > >
> >>> > > > 2) On a popular request I added "team_id" as a database field -
> >>> this
> >>> > > > has quite a few far-reaching implications and it's ripple-effect
> on
> >>> > > > Airflow 2 would be far too big for the "limited" multi-team
> setup -
> >>> > > > but since we are going to do full versioning including DB changes
> >>> in
> >>> > > > Airflow 3, this is an opportunity to do it well. The
> implementation
> >>> > > > detail of it will however depend on our choice of supported
> >>> databases
> >>> > > > so there is a little dependency on other decisions here. If we
> >>> stick
> >>> > > > with both Postgres and MySQL we will likely have to restructure
> >>> the DB
> >>> > > > to have synthetic UUID identifiers in order to add both
> versioning
> >>> and
> >>> > > > multi-team (because of MySQL index limitations).
> >>> > > >
> >>> > > > 3) The "proper" team identifier also allows to expand the scope
> of
> >>> > > > multi-team to also allow "per-team" connections and variables.
> >>> Again
> >>> > > > for Airflow 2 case we could limit it to only the case where
> >>> > > > connections and variables comes only from "per-team" secrets -
> but
> >>> > > > since we are going to have DB identifiers and we are going to -
> >>> anyhow
> >>> > > > - reimplement Connections and Variables UI to get rid of FAB
> models
> >>> > > > and implement them in reactive technology, it's only a bit more
> >>> > > > complex to add "per-team" access there.
> >>> > > >
> >>> > > > 4) AIP-72 due to its "task" isolation, allows dropping the idea
> >>> about
> >>> > > > the "--team" flag from the components. With AIP-72 routing tasks
> to
> >>> > > > particular "team" executors" is enough and there is no need to
> pass
> >>> > > > the team information via "--team" flag that was originally
> >>> supposed to
> >>> > > > limit access of the components to only a single team. For
> Airflow 2
> >>> > > > and AIP-44 that was a nice "hack" so that we do not have to carry
> >>> the
> >>> > > > "authorization" information together with the task. But since
> part
> >>> of
> >>> > > > AIP-72 is to carry the verifiable meta-data that will allow us to
> >>> > > > cryptographically verify task provenance, we can drop this hack
> and
> >>> > > > rely on AIP-72 implementation.
> >>> > > >
> >>> > > > 5) since DB isolation is "given" by AIP-72, we do not have to
> split
> >>> > > > the delivery of AIP-67 into two phases (with and without DB
> >>> isolation)
> >>> > > > - it will be delivered as a single "with DB isolation" stage.
> >>> > > >
> >>> > > > Those are the major differences vs. the proposal from May ( and
> as
> >>> you
> >>> > > > might see it is quite a different scope - and this is really why
> I
> >>> > > > insisted on having Airflow 2/ Airflow 3 discussion before we
> >>> conclude
> >>> > > > the vote on it.
> >>> > > >
> >>> > > > I will go through the proposal on Thursday during our call as
> >>> planned
> >>> > > > - but feel free to start discussions and comments before.
> >>> > > >
> >>> > > > J.
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
>

Re: [DISCUSS] AIP-67 Multi-team deployment of Airflow components (reloaded)

Reply via email to