Re: [DISCUSS] AIP-67 Multi-team deployment of Airflow components (reloaded)

Jarek Potiuk Tue, 23 Jul 2024 12:32:40 -0700

I responded to some comments and had a long discussion on Slack with Ash
and I would love to hear even more comments and see if my responses are
satisfactory (so I would love to get confirmation/further comments on those
threads opened as discussion points - before I put it to a vote). There are
18 inline comments now that wait for "yeah looks good" or "no I have still
doubts":


https://cwiki.apache.org/confluence/display/AIRFLOW/AIP
-67+Multi-team+deployment+of+Airflow+components

Things that I wanted to address as general "big points" and explain them
here - so that others can make their comments on it here as well.

1) (Mostly Ash)* Is it worth it? Should it be part of the community effort
?* a) On one hand AIP-72 gives a lot of isolation, b) AIP-66 Bundles and
Parsing **might** introduce separate execution environments. c) On the
other hand users can wrap around multiple Airflow instances if the only
"real" benefit is a single "UI/management" endpoint.

My answer - I think it still is.

a) AIP-72 does give isolation (same as AIP-44 only more optimized and
designed for scratch), but it does not allow separation between teams for
DAG file processors. AIP-67 effectively introduces it by allowing each team
to have separate DAG file processors - this is possible even today - but
then it also allows them to effectively use a per-team environment -
because via team_id specific executors all the tasks of all the DAGs of the
team might be executing in their dedicated environment (so for example
whole team might use the same container image for processors and all the
workers they have in Celery. This is currently not part of AIP-72

b) While AIP-66 had initially the concept of a per-bundle environment, I
thought having it defined by the DAG author and the way how it could be
managed is problematic. Especially if we allow to have "any" binary
environment (whether it's container image or zip containing venv) it has
security issue (unreviewable dependencies) and performance issues for
zipped venv (installing such a bundle takes time, and environmental issues
involving shared libraries cases and architectural differences are
basically unsurmountable - for example that would make it very difficult to
test same DAGs across different architectures and even different versions
of the same OS. Also it was not clear who and how would manage those
environments (i.e. roles of actors involved). While this is a great idea, I
think it's too granular to be effectively managed "per team" - while it is
great for some exceptional cases where we can say "this DAG is so special
that it needs a separate environment". I strongly believe the "per team"
environment is a much more common case (and it coincides with "working as a
team" - where the whole team should be able to easily run and develop
various DAGs. Conway's law in full swing. AIP-67 is addressing all that -
various environments per team, Team Deployment Manager is introduced as
Actor who manages those environments and team configuration.

c) while yes, it's possible to wrap around multiple Airflows with an extra
UI layer and have them managed in common - but it either has to provide a
version of Airflow UI "clone" where you can see several DAGs - possibly
interacting with each other, or you have to have some weird ways of easy
switching or showing multiple Airflow UIs at the same time. First is going
to be super-costly to manage over time as Airflow will evolve, second is a
band-aid really IMHO. And when I think about it - I think mostly about
on-premise users, not managed-services users. Yes managed services could
afford such extra layers and it could be thought of as a nice "paid"
feature of Airflow services. So maybe indeed we should think about it as
"not in community" ? But  I know a number of users who have on-premise
Airflow (and for good reasons cannot move to managed Airflow) and we cut
all of them off from being able to have a number of their teams having
pipelines isolated from environment and security point of view, while
centrally monitored and managed.

2) (Vikram, Constance and TP ): *Permissions for Data Assets (related to
AIP-73 + sub-aips)* - I proposed that each team's Data Assets are separated
(same as for DAGs automatically team_id is assigned when parsing the DAGs
by DAG File Processor) - and that they are effectively "separate
namespaces". We have no proposal for permissions of Data Assets (yet) - it
might come later - that's why I do not want to interfere with it and
proposed a simple solution where Data Assets can be matched by URIs (which
become optional in AIP-73) and that the DAG using Data Asset with same URI
might get triggered by a different Data Set from another team but the same
URI only if the "triggered" DAG explicitly hase `allowed_triggering_by = [
"team_2", "team 3"]` (or similar).

3) (Niko, Amogh, TP) - *Scope of "access" to team DAGs/folders/managing
access* - I assume from DAG file processing POV "folder(s)" is currently
the right abstraction (and already handled by standalone DAGFileProcessor).
The nice thing about it is that control over the folder "access" can (and
should) be done outside of Airflow - we delegate it out. Similarly like
Auth Manager access - we do not want to record team members who have access
- but we want to delegate it out. We do not want to keep information about
who is in which team nor handle team changes - this should all be handled
in the corporate identity system. This way both sides are delegated out -
DAG authoring and DAG management. Airflow only knows "team_id", nothing
else once the DAG is parsed (DAG and DAG Assets created by parsing the DAG
have team_id assigned automatically). And In the future we might expand it
- when we have - for example - declarative DAGs submitted via API, the API
can have "team_id" added or when it is created via some kind of factory -
that factory might assign their ids. The "folder" that we have now is
simply to make use of DAG file processor --subdir flag - but actually it
does not matter - once the DAG gets serialized - the DAG and it's assets
will just have "team_id" assigned, and they do not have to come from the
same subdir.

Looking forward to closing those discussions and putting it finally up to a
vote.

J.





On Fri, Jul 19, 2024 at 10:49 PM Jarek Potiuk <ja...@potiuk.com> wrote:

>
> 1. The roles and responsibilities of what you have called "Organization
>> Deployment Managers" vs. "Team Deployment Managers". This is somewhat
>> different than the model I have seen in practice, so trying to reconcile
>> this in my head. If this is a terminology difference or a role difference
>> or where exactly the line is drawn.
>>
>>
> This is what is not possible in current practice, because we have no such
> distinction. Currently there is only one deployment manager role, but at
> least with my talks to some users and with the approach I proposed, it's
> possible that the "oganization deployment manager" (a.k.a Data Platform
> Team) prepares and manages Airflow as a "whole". - i.e. Running
> airflow scheduler, webserver, connection to the organisation identity
> management system. But then - those people are struggling when different
> teams have different expectations for what dependencies, OS, queues, GPU
> access etc. and Data Platform "Organisation Deployment Manager" wants to
> delegate that to each team's deployment manager - such "team DM" should be
> able - for example to manage their own K8S cluster where their team's job
> will be running - with appropriate dependencies/hardware resources etc.
> And important  aspect is that each team can manage it without "bothering"
> the Platform team (but still will be trusted enough to allow installing
> arbitrary packages),- and the platform team will mostly worry about airflow
> itself.
>
>
>> 2. DAG File Processing
>> I am quite perplexed by this and wondering about the overlap between the
>> "execution bundles" concept which Jed had defined as part of AIP-66 and
>> since deleted from there.
>>
>
> Actually that was also the result of discussion we had in the doc with Ash
> and Jed. And I see the potential of joining it. The AIP-66 defined bundle
> environment that could be specified - but the problem with that was that a)
> it assumed it is based on "pip install --target" venvs (which has a host of
> potential problems I explained) and also b) it skipped over on the
> management part of those - i.e. who prepares those environments and manages
> them - is this a DAG author ? if so, this opens  a host of security issues
> - because those environments are essentially "binary" and "not reviewable"
> - so there must be some role responsible for it in our security model
> (i.e... "bundle deployment manager"? ) - which sounds suspiciously similar
> "team deployment manager". One of the important differences of AIP-67 here
> is that it explicitly explains and allows to separate all three
> "vulnerable" components per "team"  - > DAG file Processor,. Triggerer,.
> Worker. AIP-66 does not mention that. Implicitly assuming that running the
> code from different bundles can be done on shared machines (DAG file
> processor/Schediuler) or even in the same process (Triggerer).  AIP-67 adds
> an explicit, strong isolation of code execution between the teams - so that
> the code from different teams are not executed on the same machines,
> containers, but they can be easily separated (by deployment options).
> Without AIP-67 it is impossible  - or very difficult and error-prone - to
> make sure that the code of one "bundle" cannot leak to the other "bundle".
> We seem to silently skip over the fact that both DAG File Processor and
> Triggerrer can execute the code in the same machine or process and that
> isolation in such a case is next to impossible. And this is a very
> important aspect of "enterprise security isolation". In some ways even more
> important than isolating access to the same DB.
>
> And I actually am quite open to joining it - and have either "bundle =
> team" or "bundle belongs to team".
>
>
>
>> I will read this doc again for sure.
>> It has grown and evolved so much that at least for me it was quite
>> challenging to grasp. Thanks for working through this.
>>
>> Vikram
>>
>>
>> On Thu, Jul 18, 2024 at 9:39 PM Amogh Desai <amoghdesai....@gmail.com>
>> wrote:
>>
>> > Nice, thanks for clarifying all this!
>> >
>> > Now that I read the new proposal, it is adding up to me why certain
>> > decisions were made.
>> > The decision to separate the "common" part from the "per team" part
>> adds up
>> > now. It
>> > is a traditional paradigm of separating "control plane" from "compute".
>> >
>> > Thanks & Regards,
>> > Amogh Desai
>> >
>> >
>> > On Mon, Jul 15, 2024 at 8:53 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>> >
>> > > I got the transcript and chat from the last call (thanks Kaxil!) and
>> it
>> > > allowed me to answer a few questions that were asked during my
>> > presentation
>> > > about AIP-67. I updated the AIP document but here is summary:
>> > >
>> > > 1) What about Pools (asked by Elad and Jed, Jorrick): I thought about
>> it
>> > > and I propose that pools could have (optional) team_id added. This
>> will
>> > > allow users to keep common pools (no team_id assigned) and have
>> > > team-specific ones. DAG file processor specific for each team will
>> fail
>> > DAG
>> > > if it tries using a pool that is not common and belonging to "other
>> > team".
>> > > Also each team will be able to have their own "default_pool"
>> configured.
>> > > This will give enough flexibility on 'common vs. team exclusive" use
>> of
>> > > pools.
>> > >
>> > > 2) Isolation for connections (John, Filip, Elad, Kaxil, Amogh, Ash):
>> yes.
>> > > That is part of the design. The connections and variables can be
>> accessed
>> > > per team - AIP-72 will only provide the tasks with connections that
>> > belong
>> > > to the team. Ash mentioned OPA (which might be used for that purpose).
>> > It's
>> > > not defined how exactly it will be implemented in AIP-72, it's not
>> > detailed
>> > > enough, but it can use the very mechanisms that AIP-72 - by only
>> allowing
>> > > "global" connections and "my team" connections to be passed by AIP-72
>> API
>> > > to the task and DAG file processor.
>> > >
>> > > 3) Whether "Team=deployment" - Igor / Vikram ? -> depends on what you
>> > > understand by deployment. i'd say "sub-deployment" - each deployment
>> in a
>> > > "multi-team" environment will consist of the "common" part and each
>> team
>> > > will have their own part (where configuration and management of such
>> team
>> > > deployment parts will be delegated to the team deployment manager).
>> For
>> > > example such deployment managers will be able to build and publish the
>> > > environment (for example container images) used by team A to run
>> Airflow.
>> > > Or change "team" specific configuration.
>> > >
>> > > 4) "This seems like quite a lot of work to share a scheduler and a web
>> > > server. What’s the net benefit of this complexity?" -> Ash, John,
>> Amogh,
>> > > Maciej: Yes. I absolutely see it as a valuable option. It reflects
>> > > organizational structure and needs of many of our users, where they
>> want
>> > to
>> > > manage part of the environment, monitoring of what's going in all of
>> > their
>> > > teams centrally (and manage things like upgrades of Airflow, security
>> > > centrally), while they want to delegate control of environments and
>> > > resources down to their teams. This is the need that I've heard from
>> many
>> > > users who have a "data platform team" that makes Airflow available to
>> > their
>> > > several teams. I think the proposal I have is a nice middle ground
>> that
>> > > follows Conway's law - that architecture of your system should reflect
>> > your
>> > > organizational structure - and what I separated out as "common" parts
>> is
>> > > precisely what "data platform team" would like to manage, where "team
>> > > environment" is something that data platform should (and want to)
>> > delegate
>> > > to their teams.
>> > >
>> > > 5) "I am a little surprised by a shared dataset" - Vikram/Elad : The
>> > > datasets are defined by their URLs and as such - they don't have
>> > > "ownership". As I see it - It's really important who can trigger a DAG
>> > and
>> > > the controls I proposed allow the DAG author to specify "In this DAG
>> it's
>> > > also ok when a different team (specified) triggered the dataset
>> event".
>> > But
>> > > I left a note that it is AIP-73-dependent "Expanded Data Awareness"
>> and
>> > > once we get that explained/clarified I am happy to coordinate with
>> > > Constance and see if we need to do more. Happy to hear more comments
>> on
>> > > that one.
>> > >
>> > > I reflected the 2 points and 5)  in the AIP. Looking forward to more
>> > > comments on the proposal - in the AIP or here.
>> > >
>> > > J.
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Jul 9, 2024 at 4:48 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>> > >
>> > > > Hello Everyone,
>> > > >
>> > > > I would like to resume discussion on AIP-67. After going through a
>> > > > number of discussions and clarifications about the scope of Airflow
>> 3,
>> > > > I rewrote the proposal for AIP-67 with the assumption that we will
>> do
>> > > > it for Airflow 3 only - and that it will be based on the new
>> proposed
>> > > > AIP-72 (Task Execution Interface) rather than Airflow 2-only AIP-44
>> > > > Internal API.
>> > > >
>> > > > The updated proposal is here
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components
>> > > >
>> > > > Feel free to comment there in-line or raise your "big" comments"
>> here,
>> > > > but here is the impact of changing the target to Airflow 3:
>> > > >
>> > > > 1) I proposed to change configuration of Airflow to use more
>> > > > structured TOML than plain "ini" - toml is a successor of "ini" and
>> is
>> > > > largely compatible, but it has arrays, tables and nesting, has good
>> > > > support in Python and is "de-facto" standard for configuration now
>> > > > (pyproject.toml and the like). This was far too big of a change for
>> > > > Airflow 2 but with Airflow 3 it seems very appropriate.
>> > > >
>> > > > 2) On a popular request I added "team_id" as a database field - this
>> > > > has quite a few far-reaching implications and it's ripple-effect on
>> > > > Airflow 2 would be far too big for the "limited" multi-team setup -
>> > > > but since we are going to do full versioning including DB changes in
>> > > > Airflow 3, this is an opportunity to do it well. The implementation
>> > > > detail of it will however depend on our choice of supported
>> databases
>> > > > so there is a little dependency on other decisions here. If we stick
>> > > > with both Postgres and MySQL we will likely have to restructure the
>> DB
>> > > > to have synthetic UUID identifiers in order to add both versioning
>> and
>> > > > multi-team (because of MySQL index limitations).
>> > > >
>> > > > 3) The "proper" team identifier also allows to expand the scope of
>> > > > multi-team to also allow "per-team" connections and variables. Again
>> > > > for Airflow 2 case we could limit it to only the case where
>> > > > connections and variables comes only from "per-team" secrets - but
>> > > > since we are going to have DB identifiers and we are going to -
>> anyhow
>> > > > - reimplement Connections and Variables UI to get rid of FAB models
>> > > > and implement them in reactive technology, it's only a bit more
>> > > > complex to add "per-team" access there.
>> > > >
>> > > > 4) AIP-72 due to its "task" isolation, allows dropping the idea
>> about
>> > > > the "--team" flag from the components. With AIP-72 routing tasks to
>> > > > particular "team" executors" is enough and there is no need to pass
>> > > > the team information via "--team" flag that was originally supposed
>> to
>> > > > limit access of the components to only a single team. For Airflow 2
>> > > > and AIP-44 that was a nice "hack" so that we do not have to carry
>> the
>> > > > "authorization" information together with the task. But since part
>> of
>> > > > AIP-72 is to carry the verifiable meta-data that will allow us to
>> > > > cryptographically verify task provenance, we can drop this hack and
>> > > > rely on AIP-72 implementation.
>> > > >
>> > > > 5) since DB isolation is "given" by AIP-72, we do not have to split
>> > > > the delivery of AIP-67 into two phases (with and without DB
>> isolation)
>> > > > - it will be delivered as a single "with DB isolation" stage.
>> > > >
>> > > > Those are the major differences vs. the proposal from May ( and as
>> you
>> > > > might see it is quite a different scope - and this is really why I
>> > > > insisted on having Airflow 2/ Airflow 3 discussion before we
>> conclude
>> > > > the vote on it.
>> > > >
>> > > > I will go through the proposal on Thursday during our call as
>> planned
>> > > > - but feel free to start discussions and comments before.
>> > > >
>> > > > J.
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] AIP-67 Multi-team deployment of Airflow components (reloaded)

Reply via email to