One more thing to add - if there will be no more comments, I will start a vote soon - but maybe some clarifications might help - I spoke to few other people about it:
* this one is heavily based on other AIPs that are also part of Airlfow 3. And while some parts are independent, AIP-72 (Task Isolation) and AIP-62 (Parsing and bundling) and some details of those might have impact on some of the details - and maybe what will happen those dependencies will **not** complete in time for Airflow 3.0, So I am happy to mark it as Airflow 3.0 (or 3.1 depending on dependencies) * the AIP is not trying to address "huge" installations with multiple tenants. Those are better served by writing extra UI layers and more-or-less "hiding" airflow effectively behind the corporate "solution" for those who can afford it. It's mostly for "mid-size" deployments, where there are enough separate (but interdependent) teams to manage but where the users won't have the capacity to make their own "uber-airflow" on top and where Airlfow UI might be useful to manage all the pipelines from those teams together. I might want to clarify it if that helps to convince those unconvinced. J On Tue, Jul 23, 2024 at 9:32 PM Jarek Potiuk <ja...@potiuk.com> wrote: > I responded to some comments and had a long discussion on Slack with Ash > and I would love to hear even more comments and see if my responses are > satisfactory (so I would love to get confirmation/further comments on those > threads opened as discussion points - before I put it to a vote). There are > 18 inline comments now that wait for "yeah looks good" or "no I have still > doubts": > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP > -67+Multi-team+deployment+of+Airflow+components > > Things that I wanted to address as general "big points" and explain them > here - so that others can make their comments on it here as well. > > 1) (Mostly Ash)* Is it worth it? Should it be part of the community > effort ?* a) On one hand AIP-72 gives a lot of isolation, b) AIP-66 > Bundles and Parsing **might** introduce separate execution environments. c) > On the other hand users can wrap around multiple Airflow instances if the > only "real" benefit is a single "UI/management" endpoint. > > My answer - I think it still is. > > a) AIP-72 does give isolation (same as AIP-44 only more optimized and > designed for scratch), but it does not allow separation between teams for > DAG file processors. AIP-67 effectively introduces it by allowing each team > to have separate DAG file processors - this is possible even today - but > then it also allows them to effectively use a per-team environment - > because via team_id specific executors all the tasks of all the DAGs of the > team might be executing in their dedicated environment (so for example > whole team might use the same container image for processors and all the > workers they have in Celery. This is currently not part of AIP-72 > > b) While AIP-66 had initially the concept of a per-bundle environment, I > thought having it defined by the DAG author and the way how it could be > managed is problematic. Especially if we allow to have "any" binary > environment (whether it's container image or zip containing venv) it has > security issue (unreviewable dependencies) and performance issues for > zipped venv (installing such a bundle takes time, and environmental issues > involving shared libraries cases and architectural differences are > basically unsurmountable - for example that would make it very difficult to > test same DAGs across different architectures and even different versions > of the same OS. Also it was not clear who and how would manage those > environments (i.e. roles of actors involved). While this is a great idea, I > think it's too granular to be effectively managed "per team" - while it is > great for some exceptional cases where we can say "this DAG is so special > that it needs a separate environment". I strongly believe the "per team" > environment is a much more common case (and it coincides with "working as a > team" - where the whole team should be able to easily run and develop > various DAGs. Conway's law in full swing. AIP-67 is addressing all that - > various environments per team, Team Deployment Manager is introduced as > Actor who manages those environments and team configuration. > > c) while yes, it's possible to wrap around multiple Airflows with an extra > UI layer and have them managed in common - but it either has to provide a > version of Airflow UI "clone" where you can see several DAGs - possibly > interacting with each other, or you have to have some weird ways of easy > switching or showing multiple Airflow UIs at the same time. First is going > to be super-costly to manage over time as Airflow will evolve, second is a > band-aid really IMHO. And when I think about it - I think mostly about > on-premise users, not managed-services users. Yes managed services could > afford such extra layers and it could be thought of as a nice "paid" > feature of Airflow services. So maybe indeed we should think about it as > "not in community" ? But I know a number of users who have on-premise > Airflow (and for good reasons cannot move to managed Airflow) and we cut > all of them off from being able to have a number of their teams having > pipelines isolated from environment and security point of view, while > centrally monitored and managed. > > 2) (Vikram, Constance and TP ): *Permissions for Data Assets (related to > AIP-73 + sub-aips)* - I proposed that each team's Data Assets are > separated (same as for DAGs automatically team_id is assigned when parsing > the DAGs by DAG File Processor) - and that they are effectively "separate > namespaces". We have no proposal for permissions of Data Assets (yet) - it > might come later - that's why I do not want to interfere with it and > proposed a simple solution where Data Assets can be matched by URIs (which > become optional in AIP-73) and that the DAG using Data Asset with same URI > might get triggered by a different Data Set from another team but the same > URI only if the "triggered" DAG explicitly hase `allowed_triggering_by = [ > "team_2", "team 3"]` (or similar). > > 3) (Niko, Amogh, TP) - *Scope of "access" to team DAGs/folders/managing > access* - I assume from DAG file processing POV "folder(s)" is currently > the right abstraction (and already handled by standalone DAGFileProcessor). > The nice thing about it is that control over the folder "access" can (and > should) be done outside of Airflow - we delegate it out. Similarly like > Auth Manager access - we do not want to record team members who have access > - but we want to delegate it out. We do not want to keep information about > who is in which team nor handle team changes - this should all be handled > in the corporate identity system. This way both sides are delegated out - > DAG authoring and DAG management. Airflow only knows "team_id", nothing > else once the DAG is parsed (DAG and DAG Assets created by parsing the DAG > have team_id assigned automatically). And In the future we might expand it > - when we have - for example - declarative DAGs submitted via API, the API > can have "team_id" added or when it is created via some kind of factory - > that factory might assign their ids. The "folder" that we have now is > simply to make use of DAG file processor --subdir flag - but actually it > does not matter - once the DAG gets serialized - the DAG and it's assets > will just have "team_id" assigned, and they do not have to come from the > same subdir. > > Looking forward to closing those discussions and putting it finally up to > a vote. > > J. > > > > > > On Fri, Jul 19, 2024 at 10:49 PM Jarek Potiuk <ja...@potiuk.com> wrote: > >> >> 1. The roles and responsibilities of what you have called "Organization >>> Deployment Managers" vs. "Team Deployment Managers". This is somewhat >>> different than the model I have seen in practice, so trying to reconcile >>> this in my head. If this is a terminology difference or a role difference >>> or where exactly the line is drawn. >>> >>> >> This is what is not possible in current practice, because we have no such >> distinction. Currently there is only one deployment manager role, but at >> least with my talks to some users and with the approach I proposed, it's >> possible that the "oganization deployment manager" (a.k.a Data Platform >> Team) prepares and manages Airflow as a "whole". - i.e. Running >> airflow scheduler, webserver, connection to the organisation identity >> management system. But then - those people are struggling when different >> teams have different expectations for what dependencies, OS, queues, GPU >> access etc. and Data Platform "Organisation Deployment Manager" wants to >> delegate that to each team's deployment manager - such "team DM" should be >> able - for example to manage their own K8S cluster where their team's job >> will be running - with appropriate dependencies/hardware resources etc. >> And important aspect is that each team can manage it without "bothering" >> the Platform team (but still will be trusted enough to allow installing >> arbitrary packages),- and the platform team will mostly worry about airflow >> itself. >> >> >>> 2. DAG File Processing >>> I am quite perplexed by this and wondering about the overlap between the >>> "execution bundles" concept which Jed had defined as part of AIP-66 and >>> since deleted from there. >>> >> >> Actually that was also the result of discussion we had in the doc with >> Ash and Jed. And I see the potential of joining it. The AIP-66 >> defined bundle environment that could be specified - but the problem with >> that was that a) it assumed it is based on "pip install --target" venvs >> (which has a host of potential problems I explained) and also b) it skipped >> over on the management part of those - i.e. who prepares those >> environments and manages them - is this a DAG author ? if so, this opens a >> host of security issues - because those environments are essentially >> "binary" and "not reviewable" - so there must be some role responsible for >> it in our security model (i.e... "bundle deployment manager"? ) - which >> sounds suspiciously similar "team deployment manager". One of the important >> differences of AIP-67 here is that it explicitly explains and allows to >> separate all three "vulnerable" components per "team" - > DAG file >> Processor,. Triggerer,. Worker. AIP-66 does not mention that. Implicitly >> assuming that running the code from different bundles can be done on shared >> machines (DAG file processor/Schediuler) or even in the same process >> (Triggerer). AIP-67 adds an explicit, strong isolation of code execution >> between the teams - so that the code from different teams are not executed >> on the same machines, containers, but they can be easily separated (by >> deployment options). Without AIP-67 it is impossible - or very difficult >> and error-prone - to make sure that the code of one "bundle" cannot leak to >> the other "bundle". We seem to silently skip over the fact that both DAG >> File Processor and Triggerrer can execute the code in the same machine or >> process and that isolation in such a case is next to impossible. And this >> is a very important aspect of "enterprise security isolation". In some ways >> even more important than isolating access to the same DB. >> >> And I actually am quite open to joining it - and have either "bundle = >> team" or "bundle belongs to team". >> >> >> >>> I will read this doc again for sure. >>> It has grown and evolved so much that at least for me it was quite >>> challenging to grasp. Thanks for working through this. >>> >>> Vikram >>> >>> >>> On Thu, Jul 18, 2024 at 9:39 PM Amogh Desai <amoghdesai....@gmail.com> >>> wrote: >>> >>> > Nice, thanks for clarifying all this! >>> > >>> > Now that I read the new proposal, it is adding up to me why certain >>> > decisions were made. >>> > The decision to separate the "common" part from the "per team" part >>> adds up >>> > now. It >>> > is a traditional paradigm of separating "control plane" from "compute". >>> > >>> > Thanks & Regards, >>> > Amogh Desai >>> > >>> > >>> > On Mon, Jul 15, 2024 at 8:53 PM Jarek Potiuk <ja...@potiuk.com> wrote: >>> > >>> > > I got the transcript and chat from the last call (thanks Kaxil!) and >>> it >>> > > allowed me to answer a few questions that were asked during my >>> > presentation >>> > > about AIP-67. I updated the AIP document but here is summary: >>> > > >>> > > 1) What about Pools (asked by Elad and Jed, Jorrick): I thought >>> about it >>> > > and I propose that pools could have (optional) team_id added. This >>> will >>> > > allow users to keep common pools (no team_id assigned) and have >>> > > team-specific ones. DAG file processor specific for each team will >>> fail >>> > DAG >>> > > if it tries using a pool that is not common and belonging to "other >>> > team". >>> > > Also each team will be able to have their own "default_pool" >>> configured. >>> > > This will give enough flexibility on 'common vs. team exclusive" use >>> of >>> > > pools. >>> > > >>> > > 2) Isolation for connections (John, Filip, Elad, Kaxil, Amogh, Ash): >>> yes. >>> > > That is part of the design. The connections and variables can be >>> accessed >>> > > per team - AIP-72 will only provide the tasks with connections that >>> > belong >>> > > to the team. Ash mentioned OPA (which might be used for that >>> purpose). >>> > It's >>> > > not defined how exactly it will be implemented in AIP-72, it's not >>> > detailed >>> > > enough, but it can use the very mechanisms that AIP-72 - by only >>> allowing >>> > > "global" connections and "my team" connections to be passed by >>> AIP-72 API >>> > > to the task and DAG file processor. >>> > > >>> > > 3) Whether "Team=deployment" - Igor / Vikram ? -> depends on what you >>> > > understand by deployment. i'd say "sub-deployment" - each deployment >>> in a >>> > > "multi-team" environment will consist of the "common" part and each >>> team >>> > > will have their own part (where configuration and management of such >>> team >>> > > deployment parts will be delegated to the team deployment manager). >>> For >>> > > example such deployment managers will be able to build and publish >>> the >>> > > environment (for example container images) used by team A to run >>> Airflow. >>> > > Or change "team" specific configuration. >>> > > >>> > > 4) "This seems like quite a lot of work to share a scheduler and a >>> web >>> > > server. What’s the net benefit of this complexity?" -> Ash, John, >>> Amogh, >>> > > Maciej: Yes. I absolutely see it as a valuable option. It reflects >>> > > organizational structure and needs of many of our users, where they >>> want >>> > to >>> > > manage part of the environment, monitoring of what's going in all of >>> > their >>> > > teams centrally (and manage things like upgrades of Airflow, security >>> > > centrally), while they want to delegate control of environments and >>> > > resources down to their teams. This is the need that I've heard from >>> many >>> > > users who have a "data platform team" that makes Airflow available to >>> > their >>> > > several teams. I think the proposal I have is a nice middle ground >>> that >>> > > follows Conway's law - that architecture of your system should >>> reflect >>> > your >>> > > organizational structure - and what I separated out as "common" >>> parts is >>> > > precisely what "data platform team" would like to manage, where "team >>> > > environment" is something that data platform should (and want to) >>> > delegate >>> > > to their teams. >>> > > >>> > > 5) "I am a little surprised by a shared dataset" - Vikram/Elad : The >>> > > datasets are defined by their URLs and as such - they don't have >>> > > "ownership". As I see it - It's really important who can trigger a >>> DAG >>> > and >>> > > the controls I proposed allow the DAG author to specify "In this DAG >>> it's >>> > > also ok when a different team (specified) triggered the dataset >>> event". >>> > But >>> > > I left a note that it is AIP-73-dependent "Expanded Data Awareness" >>> and >>> > > once we get that explained/clarified I am happy to coordinate with >>> > > Constance and see if we need to do more. Happy to hear more comments >>> on >>> > > that one. >>> > > >>> > > I reflected the 2 points and 5) in the AIP. Looking forward to more >>> > > comments on the proposal - in the AIP or here. >>> > > >>> > > J. >>> > > >>> > > >>> > > >>> > > >>> > > On Tue, Jul 9, 2024 at 4:48 PM Jarek Potiuk <ja...@potiuk.com> >>> wrote: >>> > > >>> > > > Hello Everyone, >>> > > > >>> > > > I would like to resume discussion on AIP-67. After going through a >>> > > > number of discussions and clarifications about the scope of >>> Airflow 3, >>> > > > I rewrote the proposal for AIP-67 with the assumption that we will >>> do >>> > > > it for Airflow 3 only - and that it will be based on the new >>> proposed >>> > > > AIP-72 (Task Execution Interface) rather than Airflow 2-only AIP-44 >>> > > > Internal API. >>> > > > >>> > > > The updated proposal is here >>> > > > >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components >>> > > > >>> > > > Feel free to comment there in-line or raise your "big" comments" >>> here, >>> > > > but here is the impact of changing the target to Airflow 3: >>> > > > >>> > > > 1) I proposed to change configuration of Airflow to use more >>> > > > structured TOML than plain "ini" - toml is a successor of "ini" >>> and is >>> > > > largely compatible, but it has arrays, tables and nesting, has good >>> > > > support in Python and is "de-facto" standard for configuration now >>> > > > (pyproject.toml and the like). This was far too big of a change for >>> > > > Airflow 2 but with Airflow 3 it seems very appropriate. >>> > > > >>> > > > 2) On a popular request I added "team_id" as a database field - >>> this >>> > > > has quite a few far-reaching implications and it's ripple-effect on >>> > > > Airflow 2 would be far too big for the "limited" multi-team setup - >>> > > > but since we are going to do full versioning including DB changes >>> in >>> > > > Airflow 3, this is an opportunity to do it well. The implementation >>> > > > detail of it will however depend on our choice of supported >>> databases >>> > > > so there is a little dependency on other decisions here. If we >>> stick >>> > > > with both Postgres and MySQL we will likely have to restructure >>> the DB >>> > > > to have synthetic UUID identifiers in order to add both versioning >>> and >>> > > > multi-team (because of MySQL index limitations). >>> > > > >>> > > > 3) The "proper" team identifier also allows to expand the scope of >>> > > > multi-team to also allow "per-team" connections and variables. >>> Again >>> > > > for Airflow 2 case we could limit it to only the case where >>> > > > connections and variables comes only from "per-team" secrets - but >>> > > > since we are going to have DB identifiers and we are going to - >>> anyhow >>> > > > - reimplement Connections and Variables UI to get rid of FAB models >>> > > > and implement them in reactive technology, it's only a bit more >>> > > > complex to add "per-team" access there. >>> > > > >>> > > > 4) AIP-72 due to its "task" isolation, allows dropping the idea >>> about >>> > > > the "--team" flag from the components. With AIP-72 routing tasks to >>> > > > particular "team" executors" is enough and there is no need to pass >>> > > > the team information via "--team" flag that was originally >>> supposed to >>> > > > limit access of the components to only a single team. For Airflow 2 >>> > > > and AIP-44 that was a nice "hack" so that we do not have to carry >>> the >>> > > > "authorization" information together with the task. But since part >>> of >>> > > > AIP-72 is to carry the verifiable meta-data that will allow us to >>> > > > cryptographically verify task provenance, we can drop this hack and >>> > > > rely on AIP-72 implementation. >>> > > > >>> > > > 5) since DB isolation is "given" by AIP-72, we do not have to split >>> > > > the delivery of AIP-67 into two phases (with and without DB >>> isolation) >>> > > > - it will be delivered as a single "with DB isolation" stage. >>> > > > >>> > > > Those are the major differences vs. the proposal from May ( and as >>> you >>> > > > might see it is quite a different scope - and this is really why I >>> > > > insisted on having Airflow 2/ Airflow 3 discussion before we >>> conclude >>> > > > the vote on it. >>> > > > >>> > > > I will go through the proposal on Thursday during our call as >>> planned >>> > > > - but feel free to start discussions and comments before. >>> > > > >>> > > > J. >>> > > > >>> > > >>> > >>> >>