Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Sumit Maheshwari Wed, 13 Aug 2025 00:31:56 -0700

Thanks Ash and Jarek, for the detailed comments. I largely agree with the
points mentioned by you guys, hence I updated the AIP and added a section
on Authentication, API Versioning, and Packaging as well. Please go through
it once more and let me know if there are more things to consider before I
open it for voting.



On Thu, Aug 7, 2025 at 6:03 PM Jarek Potiuk <[email protected]> wrote:

> Also
>
> > 1. The Authentication token. How will this long lived token work without
> being insecure. Who and what will generate it? How will we identify
> top-level requests for Variables in order to be able to add Variable
> RBAC/ACLs. This is an important enough thing that I think it needs
> discussion before we vote on this AIP.
>
> We are currently discussing - in the security team - approach for JWT token
> handling, so likely we could move the discussion there, it does have some
> security implications and I think we should bring our finding to the
> devlist when we complete it, but I think we should add this case there.
> IMHO we should have a different approach for UI, different for Tasks,
> different for Triggerer, and different for DagProcessor. (possibly the
> Trigerer and DagProcessor could be the same because they share essentially
> the same long-living token. Ash - I will add this to the discussion there.
>
> J.
>
>
>
> On Thu, Aug 7, 2025 at 2:23 PM Jarek Potiuk <[email protected]> wrote:
>
> > Ah.. So if we are talking about a more complete approach - seeing those
> > comments from Ash - make me think if we should have another AIP.
> > (connected) about splitting the distributions. We have never finalized it
> > (nor even discussed id) but Ash - you had some initial document for that.
> > So maybe we should finalize it and rather than specify it in this AIP -
> > have a separate AIP about distribution split that AIP-92 could depend on.
> > It seems much more reasonable to split "distribution and code split" from
> > parsing isolation I think and implement them separately/in parallel.
> >
> > Reading Ash comments (and maybe I am going a bit further than Ash) it
> > calls for something that I am a big proponent of - splitting
> "airflow-core"
> > and having a different "scheduler". "webserver", "dag processor" and
> > "triggerer" distributions. Now - we have the capability of having
> "shared"
> > code - we do not need "common" code to make it happen - because we can
> > share code.
> >
> > What it could give us - on top of clean client/server split, we could
> have
> > different dependencies used by those distributions. Additionally, we
> could
> > also split-off the executors from providers and finally implement it in
> the
> > way that scheduler does not use providers at all (not even
> cncf.kubernetes
> > nor celery providers installed in scheduler nor webserver but "executors"
> > packages instead. The code sharing approach with symlinks we have now
> will
> > make it a .... breeze :) . That would also imply sharing "connection"
> > definitions through DB, and likely implementing "test connection"
> > feature properly finally (i.e executing test connection in worker /
> > triggerer rather than in web server which is a reason why we disabled it
> by
> > default now). This way "api-server" would not need any of the providers
> to
> > be installed either which IMHO is the biggest win from a security point
> of
> > view.
> >
> > And the nice thing about it is that it would be rather transparent when
> > anyone uses "pip install apache-airflow" - it would behave exactly the
> > same, no more complexity involved, simply more distributions installed
> when
> > 'apache-airflow" meta-distribution is used, but it would allow those who
> > want to implement a more complex and secure setup to have different
> > "environments" with modularized pieces of airflow installed - only
> > "apache-airflow-dag-processor + task-sdk + providers" where dag-processor
> > is run, only "apache-airflow-scheduler + executors" where scheduler is
> > installed only "apache-airflow-task-sdk + providers" where workers are
> > running, only "apache-airflow-api-server" where api-server is running and
> > only "apache-airflow-trigger + task-sdk + providers" .
> >
> > I am happy (Ash If you are fine with that) to take that original document
> > over and lead this part and new AIP to completion (including
> > implementation), I am very much convinced that this will lead to much
> > better dependency security and more modular code without impacting the
> > "apache-airflow" installation complexity.
> >
> > If we do it this way- the part of code/clean split would be "delegated
> > out" from AIP-92 to this new AIP and turned into dependency.
> >
> > J.
> >
> >
> > On Thu, Aug 7, 2025 at 1:51 PM Ash Berlin-Taylor <[email protected]> wrote:
> >
> >> This AIP is definitely heading in the right direction and is a feature
> >> I’d like to see.
> >>
> >> For me the outstanding things that need more detail:
> >>
> >> 1. The Authentication token. How will this long lived token work without
> >> being insecure. Who and what will generate it? How will we identify
> >> top-level requests for Variables in order to be able to add Variable
> >> RBAC/ACLs. This is an important enough thing that I think it needs
> >> discussion before we vote on this AIP.
> >> 2. Security generally — how will this work, especially with the
> >> multi-team? I think this likely means making the APIs work on the bundle
> >> level as you mention in the doc, but I haven’t thought deeply about this
> >> yet.
> >> 3. API Versioning? One of the the key driving goals with AIP-72 and the
> >> Task Execution SDK was the idea that “you can upgrade the API server as
> you
> >> like, and your clients/workers never need to work” — i.e. the API
> server is
> >> 100% working with all older versions of the TaskSDK. I don’t know if we
> >> will achieve that goal in the long run but it is the desire, and part of
> >> why we are using CalVer and the Cadwyn library to provide API
> versioning.
> >> 4. As mentioned previously, not sure the existing serialised JSON format
> >> for DAGs is correct, but since that now has version and we already have
> the
> >> ability to upgrade that somewhere in the Airflow Core that doesn’t
> >> necessarily become a blocker/pre-requisite for this AIP.
> >>
> >> I think Dag parsing API client+submission+parsing process manager should
> >> either live in the Task SDK dist, or in a new separate dist that uses
> >> TaskSDK, but crucially not in apache-airflow-core. My reason for this is
> >> that I want it to be possible for the server components (scheduler, API
> >> server) to not need task-sdk installed (just for cleanliness/avoiding
> >> confusion about what versions it needs) and also vice-verse, to be able
> to
> >> run a “team worker bundle” (Dag parsing, workers, triggered/async
> workers)
> >> on whatever version of TaskSDK they choose, again without
> >> apache-airflow-core installed for avoidance of doubt.
> >>
> >> Generally I would like this as it means we can have a nicer separation
> of
> >> Core and Dag parsing code, as the dag parsing itself uses the SDK, it
> would
> >> be nice to have a proper server/client split, both from a tighter
> security
> >> point-of-view, but also from a code layout point of view.
> >>
> >> -ash
> >>
> >>
> >> > On 7 Aug 2025, at 12:36, Jarek Potiuk <[email protected]> wrote:
> >> >
> >> > Well, you started it - so it's up to you to decide if you think we
> have
> >> > consensus, or whether we need a vote.
> >> >
> >> > And It's not a question of "informal" vote but it's rather clear
> >> following
> >> > the https://www.apache.org/foundation/voting.html that we either
> need a
> >> > LAZY CONSENSUS or VOTE thread. Both are formal.
> >> >
> >> > This is the difficult part when you have a proposal, to assess (by
> you)
> >> > whether we are converging to consensus or whether vote is needed.
> There
> >> is
> >> > no other body or "authority" to do it for you.
> >> >
> >> > J.
> >> >
> >> > On Thu, Aug 7, 2025 at 1:02 PM Sumit Maheshwari <
> [email protected]
> >> >
> >> > wrote:
> >> >
> >> >> Sorry for nudging again, but can we get into some consensus on this?
> I
> >> mean
> >> >> if this AIP isn't good enough, then we can drop it altogether and
> >> someone
> >> >> can rethink the whole thing. Should we do some kind of informal
> voting
> >> and
> >> >> close this thread?
> >> >>
> >> >> On Mon, Aug 4, 2025 at 3:32 PM Jarek Potiuk <[email protected]>
> wrote:
> >> >>
> >> >>>>> My main concern with this right now is the serialisation format of
> >> >> DAGs
> >> >>> —
> >> >>>> it wasn’t really designed with remote submission in mind, so it
> need
> >> >> some
> >> >>>> careful examination to see if it is fit for this purpose or not.
> >> >>>
> >> >>> I understand Ash's concerns - the format has not been designed with
> >> >>> size/speed optimization in mind so **possibly** we could design a
> >> >> different
> >> >>> format that would be better suited.
> >> >>>
> >> >>> BUT  ... Done is better than perfect.
> >> >>>
> >> >>> I think there are a number of risks involved in changing the format
> >> and
> >> >> it
> >> >>> could significantly increase time of development with uncertain
> gains
> >> at
> >> >>> the end - also because of the progress in compression that happened
> >> over
> >> >>> the last few years.
> >> >>>
> >> >>> It might be a good idea to experiment a bit with different
> compression
> >> >>> algorithms for "our" dag representation and possibly we could find
> the
> >> >> best
> >> >>> algorithm for "airflow dag" type of json data. There are a lot of
> >> >>> repetitions in the JSON representation and I guess in "our" json
> >> >>> representation there are some artifacts and repeated sections that
> >> simply
> >> >>> might compress well with different algorithms. Also in this case
> >> >>> speed matters (and CPU trade-off).
> >> >>>
> >> >>> Looking at compression "theory" - before we experiment with it -
> >> there is
> >> >>> the relatively new standard "zstandard"
> >> https://github.com/facebook/zstd
> >> >>> compression opensourced in 2016 which I've heard good things about -
> >> >>> especially that it maintains a very good compression rate for text
> >> data,
> >> >>> but also it is tremendously fast - especially for decompression
> >> (which is
> >> >>> super important factor for us - we compress new DAG representation
> far
> >> >> less
> >> >>> often than decompress it in general case). It is standardized in RFC
> >> >>> https://datatracker.ietf.org/doc/html/rfc8878 and there are various
> >> >>> implementations and it is even being added to Python standard
> library
> >> in
> >> >>> Python 3.14
> >> https://docs.python.org/3.14/library/compression.zstd.html
> >> >> and
> >> >>> there is a very well maintained python binding library
> >> >>> https://pypi.org/project/zstd/ to Yann Collet (algorithm author)
> >> ZSTD C
> >> >>> library. And libzstd is already part of our images - it is needed by
> >> >> other
> >> >>> dependencies of ours. All with BSD licence, directly usable by us.
> >> >>>
> >> >>> I think this one might be a good candidate for us to try, and
> possibly
> >> >> with
> >> >>> zstd we could achieve both size and CPU overhead that would be
> >> comparable
> >> >>> with any "new" format we could come up with - especially that we are
> >> >>> talking merely about processing a huge blob between "storable"
> >> >> (compressed)
> >> >>> and "locally usable" state (Python dict). We could likely use a
> >> streaming
> >> >>> JSON library (say the one that is used in Pydantic internally
> >> >>> https://github.com/pydantic/jiter - we already have it as part of
> >> >>> Pydantic)
> >> >>> to also save memory - we could stream decompressed stream into
> jitter
> >> so
> >> >>> that both the json dict and string representation does not have to
> be
> >> >>> loaded fully in memory at the same time. There are likely lots of
> >> >>> optimisations we could do - I mentioned possibly streaming the data
> >> from
> >> >>> API directly to DB (if this is possible - not sure)
> >> >>>
> >> >>> J.
> >> >>>
> >> >>>
> >> >>> On Mon, Aug 4, 2025 at 9:10 AM Sumit Maheshwari <
> >> [email protected]>
> >> >>> wrote:
> >> >>>
> >> >>>>>
> >> >>>>> My main concern with this right now is the serialisation format of
> >> >>> DAGs —
> >> >>>>> it wasn’t really designed with remote submission in mind, so it
> need
> >> >>> some
> >> >>>>> careful examination to see if it is fit for this purpose or not.
> >> >>>>>
> >> >>>>
> >> >>>> I'm not sure on this point, cause if we are able to convert a DAG
> >> into
> >> >>>> JSON, then it has to be transferable over the internet.
> >> >>>>
> >> >>>> In particular One of the things I worry about is that the JSON can
> >> get
> >> >>> huge
> >> >>>>> — I’ve seem this as large as 10-20Mb for some dags
> >> >>>>
> >> >>>>
> >> >>>> Yeah, agree on this, thats why we can transfer compressed data
> >> instead
> >> >> of
> >> >>>> real json. Of course, this won't guarantee that the payload will
> >> always
> >> >>> be
> >> >>>> small enough, but we can't say that it'll definitely happen either.
> >> >>>>
> >> >>>> I also wonder if as part of this proposal we should move the
> Callback
> >> >>>>> requests off the dag parsers and on to the workers instead
> >> >>>>
> >> >>>> let's make such a "workfload" implementation stream that could
> >> support
> >> >>> both
> >> >>>>> - Deadlines and DAG parsing logic
> >> >>>>
> >> >>>>
> >> >>>> I don't have any strong opinion here, but it feels like it's gonna
> >> blow
> >> >>> up
> >> >>>> the scope of the AIP too much.
> >> >>>>
> >> >>>>
> >> >>>> On Fri, Aug 1, 2025 at 2:27 AM Jarek Potiuk <[email protected]>
> >> wrote:
> >> >>>>
> >> >>>>>> My main concern with this right now is the serialisation format
> of
> >> >>>> DAGs —
> >> >>>>> it wasn’t really designed with remote submission in mind, so it
> need
> >> >>> some
> >> >>>>> careful examination to see if it is fit for this purpose or not.
> >> >>>>>
> >> >>>>> Yep. That might be potentially a problem (or at least "need more
> >> >>>> resources
> >> >>>>> to run airflow") and that is where my "2x memory" came from if we
> do
> >> >> it
> >> >>>> in
> >> >>>>> a trivial way. Currently we a) keep the whole DAG in memory when
> >> >>>>> serializing it b) submit it to database (also using essentially
> some
> >> >>> kind
> >> >>>>> of API (implemented by the database client) - so we know the whole
> >> >>> thing
> >> >>>>> "might work" but indeed if you use a trivial implementation of
> >> >>> submitting
> >> >>>>> the whole json - it basically means that the whole json will have
> to
> >> >>> also
> >> >>>>> be kept in the memory of API server. But we also compress it when
> >> >>> needed
> >> >>>> -
> >> >>>>> I wonder what are the compression ratios we saw with those
> 10-20MBs
> >> >>> Dags
> >> >>>> -
> >> >>>>> if the problem is using strings where bool would suffice,
> >> compression
> >> >>>>> should generally help a lot. We could only ever send compressed
> data
> >> >>> over
> >> >>>>> the API - there seems to be no need to send "plain JSON" data over
> >> >> the
> >> >>>> API
> >> >>>>> or storing the plain JSON in the DB (of course that trades memory
> >> for
> >> >>>> CPU).
> >> >>>>>
> >> >>>>> I wonder if sqlalchemy 2 (and drivers for MySQL/Postgres) have
> >> >> support
> >> >>>> for
> >> >>>>> any kind if binary data streaming - because that could help a lot
> of
> >> >> if
> >> >>>> we
> >> >>>>> could use streaming HTTP API and chunk and append the binary
> chunks
> >> >>> (when
> >> >>>>> writing) - or read data in chunks ans stream them back via the
> API.
> >> >>> That
> >> >>>>> could seriously decrease the amount of memory needed by the API
> >> >> server
> >> >>> to
> >> >>>>> process such huge serialized dags.
> >> >>>>>
> >> >>>>> And yeah - I would also love the "execute task" to be implemented
> >> >> here
> >> >>> -
> >> >>>>> but I am not sure if this should be part of the same effort or
> maybe
> >> >> a
> >> >>>>> separate implementation? That sounds very loosely coupled with DB
> >> >>>>> isolation. And it seems a common theme - I think that would also
> >> make
> >> >>> the
> >> >>>>> sync Deadline alerts case that we discussed at the dev call
> today. I
> >> >>>> wonder
> >> >>>>> if that should not be kind of parallel (let's make such a
> >> "workfload"
> >> >>>>> implementation stream that could support both - Deadlines and DAG
> >> >>> parsing
> >> >>>>> logic. We have already two "users" for it and I really love the
> >> >> saying
> >> >>>> "if
> >> >>>>> you want to make something reusable - make it usable first"  -
> seems
> >> >>> like
> >> >>>>> we might have good opportunity to make such workload
> implementation
> >> >>>> "doubly
> >> >>>>> used"  from the beginning which would increase chances it will be
> >> >>>>> "reusable" for other things as well :).
> >> >>>>>
> >> >>>>> J.
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Jul 31, 2025 at 12:28 PM Ash Berlin-Taylor <
> [email protected]>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>>> My main concern with this right now is the serialisation format
> of
> >> >>>> DAGs —
> >> >>>>>> it wasn’t really designed with remote submission in mind, so it
> >> >> need
> >> >>>> some
> >> >>>>>> careful examination to see if it is fit for this purpose or not.
> >> >>>>>>
> >> >>>>>> In particular One of the things I worry about is that the JSON
> can
> >> >>> get
> >> >>>>>> huge — I’ve seem this as large as 10-20Mb for some dags(!!)
> (which
> >> >> is
> >> >>>>>> likely due to things being included as text when a bool might
> >> >>> suffice,
> >> >>>>> for
> >> >>>>>> example) But I don’t think “just submit the existing JSON over an
> >> >>> API”
> >> >>>>> is a
> >> >>>>>> good idea.
> >> >>>>>>
> >> >>>>>> I also wonder if as part of this proposal we should move the
> >> >> Callback
> >> >>>>>> requests off the dag parsers and on to the workers instead — in
> >> >>> AIP-72
> >> >>>> we
> >> >>>>>> introduced the concept of a Workload, with the only one existing
> >> >>> right
> >> >>>>> now
> >> >>>>>> is “ExecuteTask”
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> https://github.com/apache/airflow/blob/8e1201c7713d5c677fa6f6d48bbd4f6903505f61/airflow-core/src/airflow/executors/workloads.py#L87-L88
> >> >>>>>> — it might be time to finally move task and dag callbacks to the
> >> >> same
> >> >>>>> thing
> >> >>>>>> and make dag parsers only responsible for, well, parsing. :)
> >> >>>>>>
> >> >>>>>> These are all solvable problems, and this will be a great feature
> >> >> to
> >> >>>>> have,
> >> >>>>>> but we need to do some more thinking and planning first.
> >> >>>>>>
> >> >>>>>> -ash
> >> >>>>>>
> >> >>>>>>> On 31 Jul 2025, at 10:12, Sumit Maheshwari <
> >> >> [email protected]
> >> >>>>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> Gentle reminder for everyone to review the proposal.
> >> >>>>>>>
> >> >>>>>>> Updated link:
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-92+Isolate+DAG+processor%2C+Callback+processor%2C+and+Triggerer+from+core+services
> >> >>>>>>>
> >> >>>>>>> On Tue, Jul 29, 2025 at 4:37 PM Sumit Maheshwari <
> >> >>>>> [email protected]
> >> >>>>>>>
> >> >>>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> Thanks everyone for reviewing this AIP. As Jarek and others
> >> >>>>> suggested, I
> >> >>>>>>>> expanded the scope of this AIP and divided it into three
> phases.
> >> >>>> With
> >> >>>>>> the
> >> >>>>>>>> increased scope, the boundary line between this AIP and AIP-85
> >> >>> got a
> >> >>>>>> little
> >> >>>>>>>> thinner, but I believe these are still two different
> >> >> enhancements
> >> >>> to
> >> >>>>>> make.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Fri, Jul 25, 2025 at 10:51 PM Sumit Maheshwari <
> >> >>>>>> [email protected]>
> >> >>>>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Yeah, overall it makes sense to include Triggers as well to be
> >> >>> part
> >> >>>>> of
> >> >>>>>>>>> this AIP and phase out the implementation. Though I didn't
> >> >>> exclude
> >> >>>>>> Triggers
> >> >>>>>>>>> because "Uber" doesn't need that, I just thought of keeping
> the
> >> >>>> scope
> >> >>>>>> of
> >> >>>>>>>>> development small and achieving them, just like it was done in
> >> >>>>> Airlfow
> >> >>>>>> 3 by
> >> >>>>>>>>> secluding only workers and not DAG-processor & Triggers.
> >> >>>>>>>>>
> >> >>>>>>>>> But if you think Triggers should be part of this AIP itself,
> >> >>> then I
> >> >>>>> can
> >> >>>>>>>>> do that and include Triggers as well in it.
> >> >>>>>>>>>
> >> >>>>>>>>> On Fri, Jul 25, 2025 at 7:34 PM Jarek Potiuk <
> [email protected]
> >> >>>
> >> >>>>> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>> I would very much prefer the architectural choices of this
> AIP
> >> >>> are
> >> >>>>>> based
> >> >>>>>>>>>> on
> >> >>>>>>>>>> "general public" needs rather than "Uber needs" even if Uber
> >> >>> will
> >> >>>> be
> >> >>>>>>>>>> implementing it - so from my point of view having Trigger
> >> >>>> separation
> >> >>>>>> as
> >> >>>>>>>>>> part of it is quite important.
> >> >>>>>>>>>>
> >> >>>>>>>>>> But that's not even this.
> >> >>>>>>>>>>
> >> >>>>>>>>>> We've been discussing for example for Deadlines (being
> >> >>> implemented
> >> >>>>> by
> >> >>>>>>>>>> Dennis and Ramit   a possibility of short, notification-style
> >> >>>>>> "deadlines"
> >> >>>>>>>>>> to be send to triggerer for execution - this is well advanced
> >> >>> now,
> >> >>>>> and
> >> >>>>>>>>>> whether you want it or not Dag-provided code might be
> >> >> serialized
> >> >>>> and
> >> >>>>>> sent
> >> >>>>>>>>>> to triggerer for execution. This is part of our "broader"
> >> >>>>>> architectural
> >> >>>>>>>>>> change where we treat "workers" and "triggerer" similarly as
> a
> >> >>>>> general
> >> >>>>>>>>>> executors of "sync" and "async" tasks respectively. That's
> >> >> where
> >> >>>>>> Airflow
> >> >>>>>>>>>> is
> >> >>>>>>>>>> evolving towards - inevitably.
> >> >>>>>>>>>>
> >> >>>>>>>>>> But we can of course phase things in out for implementation -
> >> >>> even
> >> >>>>> if
> >> >>>>>> AIP
> >> >>>>>>>>>> should cover both, I think if the goal of the AIP and
> preamble
> >> >>> is
> >> >>>>>> about
> >> >>>>>>>>>> separating "user code" from "database" as the main reason, it
> >> >>> also
> >> >>>>>> means
> >> >>>>>>>>>> Triggerer if you ask me (from design point of view at least).
> >> >>>>>>>>>>
> >> >>>>>>>>>> Again implementation can be phased and even different people
> >> >> and
> >> >>>>> teams
> >> >>>>>>>>>> might work on those phases/pieces.
> >> >>>>>>>>>>
> >> >>>>>>>>>> J.
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Fri, Jul 25, 2025 at 2:29 PM Sumit Maheshwari <
> >> >>>>>> [email protected]
> >> >>>>>>>>>>>
> >> >>>>>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> #2. Yeah, we would need something similar for triggerers
> as
> >> >>>> well,
> >> >>>>>>>>>> but
> >> >>>>>>>>>>>> that
> >> >>>>>>>>>>>> can be done as part of a different AIP
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> You won't achieve your goal of "true" isolation of user code
> >> >> if
> >> >>>> you
> >> >>>>>>>>>> don't
> >> >>>>>>>>>>>> do triggerer. I think if the goal is to achieve it - it
> >> >> should
> >> >>>>> cover
> >> >>>>>>>>>>> both.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> My bad, should've explained our architecture for triggers as
> >> >>>> well,
> >> >>>>>>>>>>> apologies. So here it is:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>  - Triggers would be running on a centralized service, so
> >> >> all
> >> >>>> the
> >> >>>>>>>>>> Trigger
> >> >>>>>>>>>>>  classes will be part of the platform team's repo and not
> >> >> the
> >> >>>>>>>>>> customer's
> >> >>>>>>>>>>> repo
> >> >>>>>>>>>>>  - The triggers won't be able to use any libs other than std
> >> >>>> ones,
> >> >>>>>>>>>> which
> >> >>>>>>>>>>>  are being used in core Airflow (like requests, etc)
> >> >>>>>>>>>>>  - As we are the owners of the core Airflow repo, customers
> >> >>> have
> >> >>>>> to
> >> >>>>>>>>>> get
> >> >>>>>>>>>>>  our approval to land any class in this path (unlike the
> >> >> dags
> >> >>>> repo
> >> >>>>>>>>>> which
> >> >>>>>>>>>>>  they own)
> >> >>>>>>>>>>>  - When a customer's task defer, we would have an allowlist
> >> >> on
> >> >>>> our
> >> >>>>>>>>>> side
> >> >>>>>>>>>>>  to check if we should do the async polling or not
> >> >>>>>>>>>>>  - If the Trigger class isn't part of our repo (allowlist),
> >> >>> just
> >> >>>>>>>>>> fail the
> >> >>>>>>>>>>>  task, as anyway we won't be having the code that they used
> >> >> in
> >> >>>> the
> >> >>>>>>>>>>> trigger
> >> >>>>>>>>>>>  class
> >> >>>>>>>>>>>  - If any of these conditions aren't suitable for you (as a
> >> >>>>>>>>>> customer),
> >> >>>>>>>>>>>  feel free to use sync tasks only
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> But in general, I agree to make triggerer svc also
> >> >> communicate
> >> >>>> over
> >> >>>>>>>>>> apis
> >> >>>>>>>>>>> only. If that is done, then we can have instances of
> >> >> triggerer
> >> >>>> svc
> >> >>>>>>>>>> running
> >> >>>>>>>>>>> at customer's side as well, which can process any type of
> >> >>> trigger
> >> >>>>>>>>>> class.
> >> >>>>>>>>>>> Though that's not a blocker for us at the moment, cause
> >> >>> triggerer
> >> >>>>> are
> >> >>>>>>>>>>> mostly doing just polling using simple libs like requests.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On Fri, Jul 25, 2025 at 5:03 PM Igor Kholopov
> >> >>>>>>>>>> <[email protected]
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>> Thanks Sumit for the detailed proposal. Overall I believe
> it
> >> >>>>> aligns
> >> >>>>>>>>>> well
> >> >>>>>>>>>>>> with the goals of making Airflow well-scalable beyond a
> >> >>>>> single-team
> >> >>>>>>>>>>>> deployment (and AIP-85 goals), so you have my full support
> >> >>> with
> >> >>>>> this
> >> >>>>>>>>>> one.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> I've left a couple of clarification requests on the AIP
> >> >> page.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Thanks,
> >> >>>>>>>>>>>> Igor
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> On Fri, Jul 25, 2025 at 11:50 AM Sumit Maheshwari <
> >> >>>>>>>>>>> [email protected]>
> >> >>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> Thanks Jarek and Ash, for the initial review. It's good to
> >> >>> know
> >> >>>>>>>>>> that
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>> DAG processor has some preemptive measures in place to
> >> >>> prevent
> >> >>>>>>>>>> access
> >> >>>>>>>>>>>>> to the DB. However, the main issue we are trying to solve
> >> >> is
> >> >>>> not
> >> >>>>> to
> >> >>>>>>>>>>>> provide
> >> >>>>>>>>>>>>> DB creds to the customer teams, who are using Airflow as a
> >> >>>>>>>>>> multi-tenant
> >> >>>>>>>>>>>>> orchestration platform. I've updated the doc to reflect
> >> >> this
> >> >>>>> point
> >> >>>>>>>>>> as
> >> >>>>>>>>>>>> well.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Answering Jarek's points,
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> #1. Yeah, had forgot to write about token mechanism, added
> >> >>> that
> >> >>>>> in
> >> >>>>>>>>>> doc,
> >> >>>>>>>>>>>> but
> >> >>>>>>>>>>>>> still how the token can be obtained (safely) is still open
> >> >> in
> >> >>>> my
> >> >>>>>>>>>> mind.
> >> >>>>>>>>>>> I
> >> >>>>>>>>>>>>> believe the token used by task executors can be created
> >> >>> outside
> >> >>>>> of
> >> >>>>>>>>>> it
> >> >>>>>>>>>>> as
> >> >>>>>>>>>>>>> well (I may be wrong here).
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> #2. Yeah, we would need something similar for triggerers
> as
> >> >>>> well,
> >> >>>>>>>>>> but
> >> >>>>>>>>>>>> that
> >> >>>>>>>>>>>>> can be done as part of a different AIP
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> #3. Yeah, I also believe the API should work largely.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> #4. Added that in the AIP, that instead of dag_dirs we can
> >> >>> work
> >> >>>>>>>>>> with
> >> >>>>>>>>>>>>> dag_bundles and every dag-processor instance would be
> >> >> treated
> >> >>>> as
> >> >>>>> a
> >> >>>>>>>>>> diff
> >> >>>>>>>>>>>>> bundle.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Also, added points around callbacks, as these are also
> >> >>> fetched
> >> >>>>>>>>>> directly
> >> >>>>>>>>>>>>> from the DB.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> On Fri, Jul 25, 2025 at 11:58 AM Jarek Potiuk <
> >> >>>> [email protected]>
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> A clarification to this - the dag parser today is likely
> >> >>> not
> >> >>>>>>>>>>>> protection
> >> >>>>>>>>>>>>>> against a dedicated malicious DAG author, but it does
> >> >>> protect
> >> >>>>>>>>>> against
> >> >>>>>>>>>>>>>> casual DB access attempts - the db session is blanked out
> >> >> in
> >> >>>> the
> >> >>>>>>>>>>>> parsing
> >> >>>>>>>>>>>>>> process , as are the env var configs
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L274-L316
> >> >>>>>>>>>>>>>> -
> >> >>>>>>>>>>>>>> is this perfect no? but it’s much more than no protection
> >> >>>>>>>>>>>>>> Oh absolutely.. This is exactly what we discussed back
> >> >> then
> >> >>> in
> >> >>>>>>>>>> March
> >> >>>>>>>>>>> I
> >> >>>>>>>>>>>>>> think - and the way we decided to go for 3.0 with full
> >> >>>> knowledge
> >> >>>>>>>>>> it's
> >> >>>>>>>>>>>> not
> >> >>>>>>>>>>>>>> protecting against all threats.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> On Fri, Jul 25, 2025 at 8:22 AM Ash Berlin-Taylor <
> >> >>>>>>>>>> [email protected]>
> >> >>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> A clarification to this - the dag parser today is likely
> >> >>> not
> >> >>>>>>>>>>>> protection
> >> >>>>>>>>>>>>>>> against a dedicated malicious DAG author, but it does
> >> >>> protect
> >> >>>>>>>>>>> against
> >> >>>>>>>>>>>>>>> casual DB access attempts - the db session is blanked
> out
> >> >>> in
> >> >>>>>>>>>> the
> >> >>>>>>>>>>>>> parsing
> >> >>>>>>>>>>>>>>> process , as are the env var configs
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L274-L316
> >> >>>>>>>>>>>>>>> - is this perfect no? but it’s much more than no
> >> >> protection
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> On 24 Jul 2025, at 21:56, Jarek Potiuk <
> >> >> [email protected]>
> >> >>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Currently in the DagFile processor there is no
> built-in
> >> >>>>>>>>>>> protection
> >> >>>>>>>>>>>>>>> against
> >> >>>>>>>>>>>>>>>> user code from Dag Parsing to - for example - read
> >> >>> database
> >> >>>>>>>>>>>>>>>> credentials from airflow configuration and use them to
> >> >>> talk
> >> >>>>>>>>>> to DB
> >> >>>>>>>>>>>>>>> directly.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> >>
>

Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Reply via email to