Re: [DISCUSS] AIP-67 Multi-team deployment of Airflow components (reloaded)

Jarek Potiuk Thu, 25 Jul 2024 05:11:16 -0700

One more thing to add - if there will be no more comments, I will start a
vote soon - but maybe some clarifications might help - I spoke to few other
people about it:


* this one is heavily based on other AIPs that are also part of Airlfow 3.
And while some parts are independent, AIP-72 (Task Isolation) and AIP-62
(Parsing and bundling) and some details of those might have impact on some
of the details - and maybe what will happen those dependencies will **not**
complete in time for Airflow 3.0, So I am happy to mark it as Airflow 3.0
(or 3.1 depending on dependencies)
* the AIP is not trying to address "huge" installations with multiple
tenants. Those are better served by writing extra UI layers and
more-or-less "hiding" airflow effectively behind the corporate "solution"
for those who can afford it. It's mostly for "mid-size" deployments, where
there are enough separate (but interdependent) teams to manage but where
the users won't have the capacity to make their own "uber-airflow" on top
and where Airlfow UI might be useful to manage all the pipelines from those
teams together. I might want to clarify it if that helps to convince those
unconvinced.

J


On Tue, Jul 23, 2024 at 9:32 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> I responded to some comments and had a long discussion on Slack with Ash
> and I would love to hear even more comments and see if my responses are
> satisfactory (so I would love to get confirmation/further comments on those
> threads opened as discussion points - before I put it to a vote). There are
> 18 inline comments now that wait for "yeah looks good" or "no I have still
> doubts":
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP
> -67+Multi-team+deployment+of+Airflow+components
>
> Things that I wanted to address as general "big points" and explain them
> here - so that others can make their comments on it here as well.
>
> 1) (Mostly Ash)* Is it worth it? Should it be part of the community
> effort ?* a) On one hand AIP-72 gives a lot of isolation, b) AIP-66
> Bundles and Parsing **might** introduce separate execution environments. c)
> On the other hand users can wrap around multiple Airflow instances if the
> only "real" benefit is a single "UI/management" endpoint.
>
> My answer - I think it still is.
>
> a) AIP-72 does give isolation (same as AIP-44 only more optimized and
> designed for scratch), but it does not allow separation between teams for
> DAG file processors. AIP-67 effectively introduces it by allowing each team
> to have separate DAG file processors - this is possible even today - but
> then it also allows them to effectively use a per-team environment -
> because via team_id specific executors all the tasks of all the DAGs of the
> team might be executing in their dedicated environment (so for example
> whole team might use the same container image for processors and all the
> workers they have in Celery. This is currently not part of AIP-72
>
> b) While AIP-66 had initially the concept of a per-bundle environment, I
> thought having it defined by the DAG author and the way how it could be
> managed is problematic. Especially if we allow to have "any" binary
> environment (whether it's container image or zip containing venv) it has
> security issue (unreviewable dependencies) and performance issues for
> zipped venv (installing such a bundle takes time, and environmental issues
> involving shared libraries cases and architectural differences are
> basically unsurmountable - for example that would make it very difficult to
> test same DAGs across different architectures and even different versions
> of the same OS. Also it was not clear who and how would manage those
> environments (i.e. roles of actors involved). While this is a great idea, I
> think it's too granular to be effectively managed "per team" - while it is
> great for some exceptional cases where we can say "this DAG is so special
> that it needs a separate environment". I strongly believe the "per team"
> environment is a much more common case (and it coincides with "working as a
> team" - where the whole team should be able to easily run and develop
> various DAGs. Conway's law in full swing. AIP-67 is addressing all that -
> various environments per team, Team Deployment Manager is introduced as
> Actor who manages those environments and team configuration.
>
> c) while yes, it's possible to wrap around multiple Airflows with an extra
> UI layer and have them managed in common - but it either has to provide a
> version of Airflow UI "clone" where you can see several DAGs - possibly
> interacting with each other, or you have to have some weird ways of easy
> switching or showing multiple Airflow UIs at the same time. First is going
> to be super-costly to manage over time as Airflow will evolve, second is a
> band-aid really IMHO. And when I think about it - I think mostly about
> on-premise users, not managed-services users. Yes managed services could
> afford such extra layers and it could be thought of as a nice "paid"
> feature of Airflow services. So maybe indeed we should think about it as
> "not in community" ? But  I know a number of users who have on-premise
> Airflow (and for good reasons cannot move to managed Airflow) and we cut
> all of them off from being able to have a number of their teams having
> pipelines isolated from environment and security point of view, while
> centrally monitored and managed.
>
> 2) (Vikram, Constance and TP ): *Permissions for Data Assets (related to
> AIP-73 + sub-aips)* - I proposed that each team's Data Assets are
> separated (same as for DAGs automatically team_id is assigned when parsing
> the DAGs by DAG File Processor) - and that they are effectively "separate
> namespaces". We have no proposal for permissions of Data Assets (yet) - it
> might come later - that's why I do not want to interfere with it and
> proposed a simple solution where Data Assets can be matched by URIs (which
> become optional in AIP-73) and that the DAG using Data Asset with same URI
> might get triggered by a different Data Set from another team but the same
> URI only if the "triggered" DAG explicitly hase `allowed_triggering_by = [
> "team_2", "team 3"]` (or similar).
>
> 3) (Niko, Amogh, TP) - *Scope of "access" to team DAGs/folders/managing
> access* - I assume from DAG file processing POV "folder(s)" is currently
> the right abstraction (and already handled by standalone DAGFileProcessor).
> The nice thing about it is that control over the folder "access" can (and
> should) be done outside of Airflow - we delegate it out. Similarly like
> Auth Manager access - we do not want to record team members who have access
> - but we want to delegate it out. We do not want to keep information about
> who is in which team nor handle team changes - this should all be handled
> in the corporate identity system. This way both sides are delegated out -
> DAG authoring and DAG management. Airflow only knows "team_id", nothing
> else once the DAG is parsed (DAG and DAG Assets created by parsing the DAG
> have team_id assigned automatically). And In the future we might expand it
> - when we have - for example - declarative DAGs submitted via API, the API
> can have "team_id" added or when it is created via some kind of factory -
> that factory might assign their ids. The "folder" that we have now is
> simply to make use of DAG file processor --subdir flag - but actually it
> does not matter - once the DAG gets serialized - the DAG and it's assets
> will just have "team_id" assigned, and they do not have to come from the
> same subdir.
>
> Looking forward to closing those discussions and putting it finally up to
> a vote.
>
> J.
>
>
>
>
>
> On Fri, Jul 19, 2024 at 10:49 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>>
>> 1. The roles and responsibilities of what you have called "Organization
>>> Deployment Managers" vs. "Team Deployment Managers". This is somewhat
>>> different than the model I have seen in practice, so trying to reconcile
>>> this in my head. If this is a terminology difference or a role difference
>>> or where exactly the line is drawn.
>>>
>>>
>> This is what is not possible in current practice, because we have no such
>> distinction. Currently there is only one deployment manager role, but at
>> least with my talks to some users and with the approach I proposed, it's
>> possible that the "oganization deployment manager" (a.k.a Data Platform
>> Team) prepares and manages Airflow as a "whole". - i.e. Running
>> airflow scheduler, webserver, connection to the organisation identity
>> management system. But then - those people are struggling when different
>> teams have different expectations for what dependencies, OS, queues, GPU
>> access etc. and Data Platform "Organisation Deployment Manager" wants to
>> delegate that to each team's deployment manager - such "team DM" should be
>> able - for example to manage their own K8S cluster where their team's job
>> will be running - with appropriate dependencies/hardware resources etc.
>> And important  aspect is that each team can manage it without "bothering"
>> the Platform team (but still will be trusted enough to allow installing
>> arbitrary packages),- and the platform team will mostly worry about airflow
>> itself.
>>
>>
>>> 2. DAG File Processing
>>> I am quite perplexed by this and wondering about the overlap between the
>>> "execution bundles" concept which Jed had defined as part of AIP-66 and
>>> since deleted from there.
>>>
>>
>> Actually that was also the result of discussion we had in the doc with
>> Ash and Jed. And I see the potential of joining it. The AIP-66
>> defined bundle environment that could be specified - but the problem with
>> that was that a) it assumed it is based on "pip install --target" venvs
>> (which has a host of potential problems I explained) and also b) it skipped
>> over on the management part of those - i.e. who prepares those
>> environments and manages them - is this a DAG author ? if so, this opens  a
>> host of security issues - because those environments are essentially
>> "binary" and "not reviewable" - so there must be some role responsible for
>> it in our security model (i.e... "bundle deployment manager"? ) - which
>> sounds suspiciously similar "team deployment manager". One of the important
>> differences of AIP-67 here is that it explicitly explains and allows to
>> separate all three "vulnerable" components per "team"  - > DAG file
>> Processor,. Triggerer,. Worker. AIP-66 does not mention that. Implicitly
>> assuming that running the code from different bundles can be done on shared
>> machines (DAG file processor/Schediuler) or even in the same process
>> (Triggerer).  AIP-67 adds an explicit, strong isolation of code execution
>> between the teams - so that the code from different teams are not executed
>> on the same machines, containers, but they can be easily separated (by
>> deployment options). Without AIP-67 it is impossible  - or very difficult
>> and error-prone - to make sure that the code of one "bundle" cannot leak to
>> the other "bundle". We seem to silently skip over the fact that both DAG
>> File Processor and Triggerrer can execute the code in the same machine or
>> process and that isolation in such a case is next to impossible. And this
>> is a very important aspect of "enterprise security isolation". In some ways
>> even more important than isolating access to the same DB.
>>
>> And I actually am quite open to joining it - and have either "bundle =
>> team" or "bundle belongs to team".
>>
>>
>>
>>> I will read this doc again for sure.
>>> It has grown and evolved so much that at least for me it was quite
>>> challenging to grasp. Thanks for working through this.
>>>
>>> Vikram
>>>
>>>
>>> On Thu, Jul 18, 2024 at 9:39 PM Amogh Desai <amoghdesai....@gmail.com>
>>> wrote:
>>>
>>> > Nice, thanks for clarifying all this!
>>> >
>>> > Now that I read the new proposal, it is adding up to me why certain
>>> > decisions were made.
>>> > The decision to separate the "common" part from the "per team" part
>>> adds up
>>> > now. It
>>> > is a traditional paradigm of separating "control plane" from "compute".
>>> >
>>> > Thanks & Regards,
>>> > Amogh Desai
>>> >
>>> >
>>> > On Mon, Jul 15, 2024 at 8:53 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>> >
>>> > > I got the transcript and chat from the last call (thanks Kaxil!) and
>>> it
>>> > > allowed me to answer a few questions that were asked during my
>>> > presentation
>>> > > about AIP-67. I updated the AIP document but here is summary:
>>> > >
>>> > > 1) What about Pools (asked by Elad and Jed, Jorrick): I thought
>>> about it
>>> > > and I propose that pools could have (optional) team_id added. This
>>> will
>>> > > allow users to keep common pools (no team_id assigned) and have
>>> > > team-specific ones. DAG file processor specific for each team will
>>> fail
>>> > DAG
>>> > > if it tries using a pool that is not common and belonging to "other
>>> > team".
>>> > > Also each team will be able to have their own "default_pool"
>>> configured.
>>> > > This will give enough flexibility on 'common vs. team exclusive" use
>>> of
>>> > > pools.
>>> > >
>>> > > 2) Isolation for connections (John, Filip, Elad, Kaxil, Amogh, Ash):
>>> yes.
>>> > > That is part of the design. The connections and variables can be
>>> accessed
>>> > > per team - AIP-72 will only provide the tasks with connections that
>>> > belong
>>> > > to the team. Ash mentioned OPA (which might be used for that
>>> purpose).
>>> > It's
>>> > > not defined how exactly it will be implemented in AIP-72, it's not
>>> > detailed
>>> > > enough, but it can use the very mechanisms that AIP-72 - by only
>>> allowing
>>> > > "global" connections and "my team" connections to be passed by
>>> AIP-72 API
>>> > > to the task and DAG file processor.
>>> > >
>>> > > 3) Whether "Team=deployment" - Igor / Vikram ? -> depends on what you
>>> > > understand by deployment. i'd say "sub-deployment" - each deployment
>>> in a
>>> > > "multi-team" environment will consist of the "common" part and each
>>> team
>>> > > will have their own part (where configuration and management of such
>>> team
>>> > > deployment parts will be delegated to the team deployment manager).
>>> For
>>> > > example such deployment managers will be able to build and publish
>>> the
>>> > > environment (for example container images) used by team A to run
>>> Airflow.
>>> > > Or change "team" specific configuration.
>>> > >
>>> > > 4) "This seems like quite a lot of work to share a scheduler and a
>>> web
>>> > > server. What’s the net benefit of this complexity?" -> Ash, John,
>>> Amogh,
>>> > > Maciej: Yes. I absolutely see it as a valuable option. It reflects
>>> > > organizational structure and needs of many of our users, where they
>>> want
>>> > to
>>> > > manage part of the environment, monitoring of what's going in all of
>>> > their
>>> > > teams centrally (and manage things like upgrades of Airflow, security
>>> > > centrally), while they want to delegate control of environments and
>>> > > resources down to their teams. This is the need that I've heard from
>>> many
>>> > > users who have a "data platform team" that makes Airflow available to
>>> > their
>>> > > several teams. I think the proposal I have is a nice middle ground
>>> that
>>> > > follows Conway's law - that architecture of your system should
>>> reflect
>>> > your
>>> > > organizational structure - and what I separated out as "common"
>>> parts is
>>> > > precisely what "data platform team" would like to manage, where "team
>>> > > environment" is something that data platform should (and want to)
>>> > delegate
>>> > > to their teams.
>>> > >
>>> > > 5) "I am a little surprised by a shared dataset" - Vikram/Elad : The
>>> > > datasets are defined by their URLs and as such - they don't have
>>> > > "ownership". As I see it - It's really important who can trigger a
>>> DAG
>>> > and
>>> > > the controls I proposed allow the DAG author to specify "In this DAG
>>> it's
>>> > > also ok when a different team (specified) triggered the dataset
>>> event".
>>> > But
>>> > > I left a note that it is AIP-73-dependent "Expanded Data Awareness"
>>> and
>>> > > once we get that explained/clarified I am happy to coordinate with
>>> > > Constance and see if we need to do more. Happy to hear more comments
>>> on
>>> > > that one.
>>> > >
>>> > > I reflected the 2 points and 5)  in the AIP. Looking forward to more
>>> > > comments on the proposal - in the AIP or here.
>>> > >
>>> > > J.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Tue, Jul 9, 2024 at 4:48 PM Jarek Potiuk <ja...@potiuk.com>
>>> wrote:
>>> > >
>>> > > > Hello Everyone,
>>> > > >
>>> > > > I would like to resume discussion on AIP-67. After going through a
>>> > > > number of discussions and clarifications about the scope of
>>> Airflow 3,
>>> > > > I rewrote the proposal for AIP-67 with the assumption that we will
>>> do
>>> > > > it for Airflow 3 only - and that it will be based on the new
>>> proposed
>>> > > > AIP-72 (Task Execution Interface) rather than Airflow 2-only AIP-44
>>> > > > Internal API.
>>> > > >
>>> > > > The updated proposal is here
>>> > > >
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components
>>> > > >
>>> > > > Feel free to comment there in-line or raise your "big" comments"
>>> here,
>>> > > > but here is the impact of changing the target to Airflow 3:
>>> > > >
>>> > > > 1) I proposed to change configuration of Airflow to use more
>>> > > > structured TOML than plain "ini" - toml is a successor of "ini"
>>> and is
>>> > > > largely compatible, but it has arrays, tables and nesting, has good
>>> > > > support in Python and is "de-facto" standard for configuration now
>>> > > > (pyproject.toml and the like). This was far too big of a change for
>>> > > > Airflow 2 but with Airflow 3 it seems very appropriate.
>>> > > >
>>> > > > 2) On a popular request I added "team_id" as a database field -
>>> this
>>> > > > has quite a few far-reaching implications and it's ripple-effect on
>>> > > > Airflow 2 would be far too big for the "limited" multi-team setup -
>>> > > > but since we are going to do full versioning including DB changes
>>> in
>>> > > > Airflow 3, this is an opportunity to do it well. The implementation
>>> > > > detail of it will however depend on our choice of supported
>>> databases
>>> > > > so there is a little dependency on other decisions here. If we
>>> stick
>>> > > > with both Postgres and MySQL we will likely have to restructure
>>> the DB
>>> > > > to have synthetic UUID identifiers in order to add both versioning
>>> and
>>> > > > multi-team (because of MySQL index limitations).
>>> > > >
>>> > > > 3) The "proper" team identifier also allows to expand the scope of
>>> > > > multi-team to also allow "per-team" connections and variables.
>>> Again
>>> > > > for Airflow 2 case we could limit it to only the case where
>>> > > > connections and variables comes only from "per-team" secrets - but
>>> > > > since we are going to have DB identifiers and we are going to -
>>> anyhow
>>> > > > - reimplement Connections and Variables UI to get rid of FAB models
>>> > > > and implement them in reactive technology, it's only a bit more
>>> > > > complex to add "per-team" access there.
>>> > > >
>>> > > > 4) AIP-72 due to its "task" isolation, allows dropping the idea
>>> about
>>> > > > the "--team" flag from the components. With AIP-72 routing tasks to
>>> > > > particular "team" executors" is enough and there is no need to pass
>>> > > > the team information via "--team" flag that was originally
>>> supposed to
>>> > > > limit access of the components to only a single team. For Airflow 2
>>> > > > and AIP-44 that was a nice "hack" so that we do not have to carry
>>> the
>>> > > > "authorization" information together with the task. But since part
>>> of
>>> > > > AIP-72 is to carry the verifiable meta-data that will allow us to
>>> > > > cryptographically verify task provenance, we can drop this hack and
>>> > > > rely on AIP-72 implementation.
>>> > > >
>>> > > > 5) since DB isolation is "given" by AIP-72, we do not have to split
>>> > > > the delivery of AIP-67 into two phases (with and without DB
>>> isolation)
>>> > > > - it will be delivered as a single "with DB isolation" stage.
>>> > > >
>>> > > > Those are the major differences vs. the proposal from May ( and as
>>> you
>>> > > > might see it is quite a different scope - and this is really why I
>>> > > > insisted on having Airflow 2/ Airflow 3 discussion before we
>>> conclude
>>> > > > the vote on it.
>>> > > >
>>> > > > I will go through the proposal on Thursday during our call as
>>> planned
>>> > > > - but feel free to start discussions and comments before.
>>> > > >
>>> > > > J.
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] AIP-67 Multi-team deployment of Airflow components (reloaded)

Reply via email to