Thanks Jarek, both the comments you added above make sense and help me understand the steps here.
I was definitely struggling with the dependencies of this AIP on AIP-72 (task isolation) and AIP-66(DAG Bundles). Specifically, there were a couple of references which were key in my mind. - Workload isolation which I do think is very clearly understood in AIP-72 - DAGFileProcessor changes which I hadn't quite thought through in the context of multi-team, which may touch both AIP-66 and AIP-72. I think I understand your concept of team very clearly and the concept of "team configuration". I did find two "per-team configuration principles" somewhat at conflict though: 1. *Each team configuration SHOULD be a separate configuration file or separate set of environment variables, holding team-specific configuration needed by DAG file processor, Workers and Triggerer* 2. In multi-team deployment, Connections and Variables have (nullable) team_id field - which makes them either belonging to a specific team or "globally available". Personally, I would have thought that you would want to standardize on either explicit configuration required or implicit allowed throughout rather than mixing. But, you probably have reasons for why a mixed approach is suitable here. I also agree with your comment on Scheduling fairness for now I assume that the DAG Parsing per-team environment is where you see the need to integrate with AIP-66 (DAG Bundles ...) Is that correct? On Thu, Jul 25, 2024 at 5:13 AM Jarek Potiuk <ja...@potiuk.com> wrote: > One more thing to add - if there will be no more comments, I will start a > vote soon - but maybe some clarifications might help - I spoke to few other > people about it: > > * this one is heavily based on other AIPs that are also part of Airlfow 3. > And while some parts are independent, AIP-72 (Task Isolation) and AIP-62 > (Parsing and bundling) and some details of those might have impact on some > of the details - and maybe what will happen those dependencies will **not** > complete in time for Airflow 3.0, So I am happy to mark it as Airflow 3.0 > (or 3.1 depending on dependencies) > * the AIP is not trying to address "huge" installations with multiple > tenants. Those are better served by writing extra UI layers and > more-or-less "hiding" airflow effectively behind the corporate "solution" > for those who can afford it. It's mostly for "mid-size" deployments, where > there are enough separate (but interdependent) teams to manage but where > the users won't have the capacity to make their own "uber-airflow" on top > and where Airlfow UI might be useful to manage all the pipelines from those > teams together. I might want to clarify it if that helps to convince those > unconvinced. > > J > > > On Tue, Jul 23, 2024 at 9:32 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > I responded to some comments and had a long discussion on Slack with Ash > > and I would love to hear even more comments and see if my responses are > > satisfactory (so I would love to get confirmation/further comments on > those > > threads opened as discussion points - before I put it to a vote). There > are > > 18 inline comments now that wait for "yeah looks good" or "no I have > still > > doubts": > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP > > -67+Multi-team+deployment+of+Airflow+components > > > > Things that I wanted to address as general "big points" and explain them > > here - so that others can make their comments on it here as well. > > > > 1) (Mostly Ash)* Is it worth it? Should it be part of the community > > effort ?* a) On one hand AIP-72 gives a lot of isolation, b) AIP-66 > > Bundles and Parsing **might** introduce separate execution environments. > c) > > On the other hand users can wrap around multiple Airflow instances if the > > only "real" benefit is a single "UI/management" endpoint. > > > > My answer - I think it still is. > > > > a) AIP-72 does give isolation (same as AIP-44 only more optimized and > > designed for scratch), but it does not allow separation between teams for > > DAG file processors. AIP-67 effectively introduces it by allowing each > team > > to have separate DAG file processors - this is possible even today - but > > then it also allows them to effectively use a per-team environment - > > because via team_id specific executors all the tasks of all the DAGs of > the > > team might be executing in their dedicated environment (so for example > > whole team might use the same container image for processors and all the > > workers they have in Celery. This is currently not part of AIP-72 > > > > b) While AIP-66 had initially the concept of a per-bundle environment, I > > thought having it defined by the DAG author and the way how it could be > > managed is problematic. Especially if we allow to have "any" binary > > environment (whether it's container image or zip containing venv) it has > > security issue (unreviewable dependencies) and performance issues for > > zipped venv (installing such a bundle takes time, and environmental > issues > > involving shared libraries cases and architectural differences are > > basically unsurmountable - for example that would make it very difficult > to > > test same DAGs across different architectures and even different versions > > of the same OS. Also it was not clear who and how would manage those > > environments (i.e. roles of actors involved). While this is a great > idea, I > > think it's too granular to be effectively managed "per team" - while it > is > > great for some exceptional cases where we can say "this DAG is so special > > that it needs a separate environment". I strongly believe the "per team" > > environment is a much more common case (and it coincides with "working > as a > > team" - where the whole team should be able to easily run and develop > > various DAGs. Conway's law in full swing. AIP-67 is addressing all that - > > various environments per team, Team Deployment Manager is introduced as > > Actor who manages those environments and team configuration. > > > > c) while yes, it's possible to wrap around multiple Airflows with an > extra > > UI layer and have them managed in common - but it either has to provide a > > version of Airflow UI "clone" where you can see several DAGs - possibly > > interacting with each other, or you have to have some weird ways of easy > > switching or showing multiple Airflow UIs at the same time. First is > going > > to be super-costly to manage over time as Airflow will evolve, second is > a > > band-aid really IMHO. And when I think about it - I think mostly about > > on-premise users, not managed-services users. Yes managed services could > > afford such extra layers and it could be thought of as a nice "paid" > > feature of Airflow services. So maybe indeed we should think about it as > > "not in community" ? But I know a number of users who have on-premise > > Airflow (and for good reasons cannot move to managed Airflow) and we cut > > all of them off from being able to have a number of their teams having > > pipelines isolated from environment and security point of view, while > > centrally monitored and managed. > > > > 2) (Vikram, Constance and TP ): *Permissions for Data Assets (related to > > AIP-73 + sub-aips)* - I proposed that each team's Data Assets are > > separated (same as for DAGs automatically team_id is assigned when > parsing > > the DAGs by DAG File Processor) - and that they are effectively "separate > > namespaces". We have no proposal for permissions of Data Assets (yet) - > it > > might come later - that's why I do not want to interfere with it and > > proposed a simple solution where Data Assets can be matched by URIs > (which > > become optional in AIP-73) and that the DAG using Data Asset with same > URI > > might get triggered by a different Data Set from another team but the > same > > URI only if the "triggered" DAG explicitly hase `allowed_triggering_by = > [ > > "team_2", "team 3"]` (or similar). > > > > 3) (Niko, Amogh, TP) - *Scope of "access" to team DAGs/folders/managing > > access* - I assume from DAG file processing POV "folder(s)" is currently > > the right abstraction (and already handled by standalone > DAGFileProcessor). > > The nice thing about it is that control over the folder "access" can (and > > should) be done outside of Airflow - we delegate it out. Similarly like > > Auth Manager access - we do not want to record team members who have > access > > - but we want to delegate it out. We do not want to keep information > about > > who is in which team nor handle team changes - this should all be handled > > in the corporate identity system. This way both sides are delegated out - > > DAG authoring and DAG management. Airflow only knows "team_id", nothing > > else once the DAG is parsed (DAG and DAG Assets created by parsing the > DAG > > have team_id assigned automatically). And In the future we might expand > it > > - when we have - for example - declarative DAGs submitted via API, the > API > > can have "team_id" added or when it is created via some kind of factory - > > that factory might assign their ids. The "folder" that we have now is > > simply to make use of DAG file processor --subdir flag - but actually it > > does not matter - once the DAG gets serialized - the DAG and it's assets > > will just have "team_id" assigned, and they do not have to come from the > > same subdir. > > > > Looking forward to closing those discussions and putting it finally up to > > a vote. > > > > J. > > > > > > > > > > > > On Fri, Jul 19, 2024 at 10:49 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > >> > >> 1. The roles and responsibilities of what you have called "Organization > >>> Deployment Managers" vs. "Team Deployment Managers". This is somewhat > >>> different than the model I have seen in practice, so trying to > reconcile > >>> this in my head. If this is a terminology difference or a role > difference > >>> or where exactly the line is drawn. > >>> > >>> > >> This is what is not possible in current practice, because we have no > such > >> distinction. Currently there is only one deployment manager role, but at > >> least with my talks to some users and with the approach I proposed, it's > >> possible that the "oganization deployment manager" (a.k.a Data Platform > >> Team) prepares and manages Airflow as a "whole". - i.e. Running > >> airflow scheduler, webserver, connection to the organisation identity > >> management system. But then - those people are struggling when different > >> teams have different expectations for what dependencies, OS, queues, GPU > >> access etc. and Data Platform "Organisation Deployment Manager" wants to > >> delegate that to each team's deployment manager - such "team DM" should > be > >> able - for example to manage their own K8S cluster where their team's > job > >> will be running - with appropriate dependencies/hardware resources etc. > >> And important aspect is that each team can manage it without > "bothering" > >> the Platform team (but still will be trusted enough to allow installing > >> arbitrary packages),- and the platform team will mostly worry about > airflow > >> itself. > >> > >> > >>> 2. DAG File Processing > >>> I am quite perplexed by this and wondering about the overlap between > the > >>> "execution bundles" concept which Jed had defined as part of AIP-66 and > >>> since deleted from there. > >>> > >> > >> Actually that was also the result of discussion we had in the doc with > >> Ash and Jed. And I see the potential of joining it. The AIP-66 > >> defined bundle environment that could be specified - but the problem > with > >> that was that a) it assumed it is based on "pip install --target" venvs > >> (which has a host of potential problems I explained) and also b) it > skipped > >> over on the management part of those - i.e. who prepares those > >> environments and manages them - is this a DAG author ? if so, this > opens a > >> host of security issues - because those environments are essentially > >> "binary" and "not reviewable" - so there must be some role responsible > for > >> it in our security model (i.e... "bundle deployment manager"? ) - which > >> sounds suspiciously similar "team deployment manager". One of the > important > >> differences of AIP-67 here is that it explicitly explains and allows to > >> separate all three "vulnerable" components per "team" - > DAG file > >> Processor,. Triggerer,. Worker. AIP-66 does not mention that. Implicitly > >> assuming that running the code from different bundles can be done on > shared > >> machines (DAG file processor/Schediuler) or even in the same process > >> (Triggerer). AIP-67 adds an explicit, strong isolation of code > execution > >> between the teams - so that the code from different teams are not > executed > >> on the same machines, containers, but they can be easily separated (by > >> deployment options). Without AIP-67 it is impossible - or very > difficult > >> and error-prone - to make sure that the code of one "bundle" cannot > leak to > >> the other "bundle". We seem to silently skip over the fact that both DAG > >> File Processor and Triggerrer can execute the code in the same machine > or > >> process and that isolation in such a case is next to impossible. And > this > >> is a very important aspect of "enterprise security isolation". In some > ways > >> even more important than isolating access to the same DB. > >> > >> And I actually am quite open to joining it - and have either "bundle = > >> team" or "bundle belongs to team". > >> > >> > >> > >>> I will read this doc again for sure. > >>> It has grown and evolved so much that at least for me it was quite > >>> challenging to grasp. Thanks for working through this. > >>> > >>> Vikram > >>> > >>> > >>> On Thu, Jul 18, 2024 at 9:39 PM Amogh Desai <amoghdesai....@gmail.com> > >>> wrote: > >>> > >>> > Nice, thanks for clarifying all this! > >>> > > >>> > Now that I read the new proposal, it is adding up to me why certain > >>> > decisions were made. > >>> > The decision to separate the "common" part from the "per team" part > >>> adds up > >>> > now. It > >>> > is a traditional paradigm of separating "control plane" from > "compute". > >>> > > >>> > Thanks & Regards, > >>> > Amogh Desai > >>> > > >>> > > >>> > On Mon, Jul 15, 2024 at 8:53 PM Jarek Potiuk <ja...@potiuk.com> > wrote: > >>> > > >>> > > I got the transcript and chat from the last call (thanks Kaxil!) > and > >>> it > >>> > > allowed me to answer a few questions that were asked during my > >>> > presentation > >>> > > about AIP-67. I updated the AIP document but here is summary: > >>> > > > >>> > > 1) What about Pools (asked by Elad and Jed, Jorrick): I thought > >>> about it > >>> > > and I propose that pools could have (optional) team_id added. This > >>> will > >>> > > allow users to keep common pools (no team_id assigned) and have > >>> > > team-specific ones. DAG file processor specific for each team will > >>> fail > >>> > DAG > >>> > > if it tries using a pool that is not common and belonging to "other > >>> > team". > >>> > > Also each team will be able to have their own "default_pool" > >>> configured. > >>> > > This will give enough flexibility on 'common vs. team exclusive" > use > >>> of > >>> > > pools. > >>> > > > >>> > > 2) Isolation for connections (John, Filip, Elad, Kaxil, Amogh, > Ash): > >>> yes. > >>> > > That is part of the design. The connections and variables can be > >>> accessed > >>> > > per team - AIP-72 will only provide the tasks with connections that > >>> > belong > >>> > > to the team. Ash mentioned OPA (which might be used for that > >>> purpose). > >>> > It's > >>> > > not defined how exactly it will be implemented in AIP-72, it's not > >>> > detailed > >>> > > enough, but it can use the very mechanisms that AIP-72 - by only > >>> allowing > >>> > > "global" connections and "my team" connections to be passed by > >>> AIP-72 API > >>> > > to the task and DAG file processor. > >>> > > > >>> > > 3) Whether "Team=deployment" - Igor / Vikram ? -> depends on what > you > >>> > > understand by deployment. i'd say "sub-deployment" - each > deployment > >>> in a > >>> > > "multi-team" environment will consist of the "common" part and each > >>> team > >>> > > will have their own part (where configuration and management of > such > >>> team > >>> > > deployment parts will be delegated to the team deployment manager). > >>> For > >>> > > example such deployment managers will be able to build and publish > >>> the > >>> > > environment (for example container images) used by team A to run > >>> Airflow. > >>> > > Or change "team" specific configuration. > >>> > > > >>> > > 4) "This seems like quite a lot of work to share a scheduler and a > >>> web > >>> > > server. What’s the net benefit of this complexity?" -> Ash, John, > >>> Amogh, > >>> > > Maciej: Yes. I absolutely see it as a valuable option. It reflects > >>> > > organizational structure and needs of many of our users, where they > >>> want > >>> > to > >>> > > manage part of the environment, monitoring of what's going in all > of > >>> > their > >>> > > teams centrally (and manage things like upgrades of Airflow, > security > >>> > > centrally), while they want to delegate control of environments and > >>> > > resources down to their teams. This is the need that I've heard > from > >>> many > >>> > > users who have a "data platform team" that makes Airflow available > to > >>> > their > >>> > > several teams. I think the proposal I have is a nice middle ground > >>> that > >>> > > follows Conway's law - that architecture of your system should > >>> reflect > >>> > your > >>> > > organizational structure - and what I separated out as "common" > >>> parts is > >>> > > precisely what "data platform team" would like to manage, where > "team > >>> > > environment" is something that data platform should (and want to) > >>> > delegate > >>> > > to their teams. > >>> > > > >>> > > 5) "I am a little surprised by a shared dataset" - Vikram/Elad : > The > >>> > > datasets are defined by their URLs and as such - they don't have > >>> > > "ownership". As I see it - It's really important who can trigger a > >>> DAG > >>> > and > >>> > > the controls I proposed allow the DAG author to specify "In this > DAG > >>> it's > >>> > > also ok when a different team (specified) triggered the dataset > >>> event". > >>> > But > >>> > > I left a note that it is AIP-73-dependent "Expanded Data Awareness" > >>> and > >>> > > once we get that explained/clarified I am happy to coordinate with > >>> > > Constance and see if we need to do more. Happy to hear more > comments > >>> on > >>> > > that one. > >>> > > > >>> > > I reflected the 2 points and 5) in the AIP. Looking forward to > more > >>> > > comments on the proposal - in the AIP or here. > >>> > > > >>> > > J. > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > On Tue, Jul 9, 2024 at 4:48 PM Jarek Potiuk <ja...@potiuk.com> > >>> wrote: > >>> > > > >>> > > > Hello Everyone, > >>> > > > > >>> > > > I would like to resume discussion on AIP-67. After going through > a > >>> > > > number of discussions and clarifications about the scope of > >>> Airflow 3, > >>> > > > I rewrote the proposal for AIP-67 with the assumption that we > will > >>> do > >>> > > > it for Airflow 3 only - and that it will be based on the new > >>> proposed > >>> > > > AIP-72 (Task Execution Interface) rather than Airflow 2-only > AIP-44 > >>> > > > Internal API. > >>> > > > > >>> > > > The updated proposal is here > >>> > > > > >>> > > > > >>> > > > >>> > > >>> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components > >>> > > > > >>> > > > Feel free to comment there in-line or raise your "big" comments" > >>> here, > >>> > > > but here is the impact of changing the target to Airflow 3: > >>> > > > > >>> > > > 1) I proposed to change configuration of Airflow to use more > >>> > > > structured TOML than plain "ini" - toml is a successor of "ini" > >>> and is > >>> > > > largely compatible, but it has arrays, tables and nesting, has > good > >>> > > > support in Python and is "de-facto" standard for configuration > now > >>> > > > (pyproject.toml and the like). This was far too big of a change > for > >>> > > > Airflow 2 but with Airflow 3 it seems very appropriate. > >>> > > > > >>> > > > 2) On a popular request I added "team_id" as a database field - > >>> this > >>> > > > has quite a few far-reaching implications and it's ripple-effect > on > >>> > > > Airflow 2 would be far too big for the "limited" multi-team > setup - > >>> > > > but since we are going to do full versioning including DB changes > >>> in > >>> > > > Airflow 3, this is an opportunity to do it well. The > implementation > >>> > > > detail of it will however depend on our choice of supported > >>> databases > >>> > > > so there is a little dependency on other decisions here. If we > >>> stick > >>> > > > with both Postgres and MySQL we will likely have to restructure > >>> the DB > >>> > > > to have synthetic UUID identifiers in order to add both > versioning > >>> and > >>> > > > multi-team (because of MySQL index limitations). > >>> > > > > >>> > > > 3) The "proper" team identifier also allows to expand the scope > of > >>> > > > multi-team to also allow "per-team" connections and variables. > >>> Again > >>> > > > for Airflow 2 case we could limit it to only the case where > >>> > > > connections and variables comes only from "per-team" secrets - > but > >>> > > > since we are going to have DB identifiers and we are going to - > >>> anyhow > >>> > > > - reimplement Connections and Variables UI to get rid of FAB > models > >>> > > > and implement them in reactive technology, it's only a bit more > >>> > > > complex to add "per-team" access there. > >>> > > > > >>> > > > 4) AIP-72 due to its "task" isolation, allows dropping the idea > >>> about > >>> > > > the "--team" flag from the components. With AIP-72 routing tasks > to > >>> > > > particular "team" executors" is enough and there is no need to > pass > >>> > > > the team information via "--team" flag that was originally > >>> supposed to > >>> > > > limit access of the components to only a single team. For > Airflow 2 > >>> > > > and AIP-44 that was a nice "hack" so that we do not have to carry > >>> the > >>> > > > "authorization" information together with the task. But since > part > >>> of > >>> > > > AIP-72 is to carry the verifiable meta-data that will allow us to > >>> > > > cryptographically verify task provenance, we can drop this hack > and > >>> > > > rely on AIP-72 implementation. > >>> > > > > >>> > > > 5) since DB isolation is "given" by AIP-72, we do not have to > split > >>> > > > the delivery of AIP-67 into two phases (with and without DB > >>> isolation) > >>> > > > - it will be delivered as a single "with DB isolation" stage. > >>> > > > > >>> > > > Those are the major differences vs. the proposal from May ( and > as > >>> you > >>> > > > might see it is quite a different scope - and this is really why > I > >>> > > > insisted on having Airflow 2/ Airflow 3 discussion before we > >>> conclude > >>> > > > the vote on it. > >>> > > > > >>> > > > I will go through the proposal on Thursday during our call as > >>> planned > >>> > > > - but feel free to start discussions and comments before. > >>> > > > > >>> > > > J. > >>> > > > > >>> > > > >>> > > >>> > >> >