Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Jarek Potiuk Thu, 07 Aug 2025 05:40:52 -0700

Ah.. So if we are talking about a more complete approach - seeing those
comments from Ash - make me think if we should have another AIP.
(connected) about splitting the distributions. We have never finalized it
(nor even discussed id) but Ash - you had some initial document for that.
So maybe we should finalize it and rather than specify it in this AIP -
have a separate AIP about distribution split that AIP-92 could depend on.
It seems much more reasonable to split "distribution and code split" from
parsing isolation I think and implement them separately/in parallel.


Reading Ash comments (and maybe I am going a bit further than Ash) it calls
for something that I am a big proponent of - splitting "airflow-core" and
having a different "scheduler". "webserver", "dag processor" and
"triggerer" distributions. Now - we have the capability of having "shared"
code - we do not need "common" code to make it happen - because we can
share code.

What it could give us - on top of clean client/server split, we could have
different dependencies used by those distributions. Additionally, we could
also split-off the executors from providers and finally implement it in the
way that scheduler does not use providers at all (not even cncf.kubernetes
nor celery providers installed in scheduler nor webserver but "executors"
packages instead. The code sharing approach with symlinks we have now will
make it a .... breeze :) . That would also imply sharing "connection"
definitions through DB, and likely implementing "test connection"
feature properly finally (i.e executing test connection in worker /
triggerer rather than in web server which is a reason why we disabled it by
default now). This way "api-server" would not need any of the providers to
be installed either which IMHO is the biggest win from a security point of
view.

And the nice thing about it is that it would be rather transparent when
anyone uses "pip install apache-airflow" - it would behave exactly the
same, no more complexity involved, simply more distributions installed when
'apache-airflow" meta-distribution is used, but it would allow those who
want to implement a more complex and secure setup to have different
"environments" with modularized pieces of airflow installed - only
"apache-airflow-dag-processor + task-sdk + providers" where dag-processor
is run, only "apache-airflow-scheduler + executors" where scheduler is
installed only "apache-airflow-task-sdk + providers" where workers are
running, only "apache-airflow-api-server" where api-server is running and
only "apache-airflow-trigger + task-sdk + providers" .

I am happy (Ash If you are fine with that) to take that original document
over and lead this part and new AIP to completion (including
implementation), I am very much convinced that this will lead to much
better dependency security and more modular code without impacting the
"apache-airflow" installation complexity.

If we do it this way- the part of code/clean split would be "delegated out"
from AIP-92 to this new AIP and turned into dependency.

J.


On Thu, Aug 7, 2025 at 1:51 PM Ash Berlin-Taylor <[email protected]> wrote:

> This AIP is definitely heading in the right direction and is a feature I’d
> like to see.
>
> For me the outstanding things that need more detail:
>
> 1. The Authentication token. How will this long lived token work without
> being insecure. Who and what will generate it? How will we identify
> top-level requests for Variables in order to be able to add Variable
> RBAC/ACLs. This is an important enough thing that I think it needs
> discussion before we vote on this AIP.
> 2. Security generally — how will this work, especially with the
> multi-team? I think this likely means making the APIs work on the bundle
> level as you mention in the doc, but I haven’t thought deeply about this
> yet.
> 3. API Versioning? One of the the key driving goals with AIP-72 and the
> Task Execution SDK was the idea that “you can upgrade the API server as you
> like, and your clients/workers never need to work” — i.e. the API server is
> 100% working with all older versions of the TaskSDK. I don’t know if we
> will achieve that goal in the long run but it is the desire, and part of
> why we are using CalVer and the Cadwyn library to provide API versioning.
> 4. As mentioned previously, not sure the existing serialised JSON format
> for DAGs is correct, but since that now has version and we already have the
> ability to upgrade that somewhere in the Airflow Core that doesn’t
> necessarily become a blocker/pre-requisite for this AIP.
>
> I think Dag parsing API client+submission+parsing process manager should
> either live in the Task SDK dist, or in a new separate dist that uses
> TaskSDK, but crucially not in apache-airflow-core. My reason for this is
> that I want it to be possible for the server components (scheduler, API
> server) to not need task-sdk installed (just for cleanliness/avoiding
> confusion about what versions it needs) and also vice-verse, to be able to
> run a “team worker bundle” (Dag parsing, workers, triggered/async workers)
> on whatever version of TaskSDK they choose, again without
> apache-airflow-core installed for avoidance of doubt.
>
> Generally I would like this as it means we can have a nicer separation of
> Core and Dag parsing code, as the dag parsing itself uses the SDK, it would
> be nice to have a proper server/client split, both from a tighter security
> point-of-view, but also from a code layout point of view.
>
> -ash
>
>
> > On 7 Aug 2025, at 12:36, Jarek Potiuk <[email protected]> wrote:
> >
> > Well, you started it - so it's up to you to decide if you think we have
> > consensus, or whether we need a vote.
> >
> > And It's not a question of "informal" vote but it's rather clear
> following
> > the https://www.apache.org/foundation/voting.html that we either need a
> > LAZY CONSENSUS or VOTE thread. Both are formal.
> >
> > This is the difficult part when you have a proposal, to assess (by you)
> > whether we are converging to consensus or whether vote is needed. There
> is
> > no other body or "authority" to do it for you.
> >
> > J.
> >
> > On Thu, Aug 7, 2025 at 1:02 PM Sumit Maheshwari <[email protected]>
> > wrote:
> >
> >> Sorry for nudging again, but can we get into some consensus on this? I
> mean
> >> if this AIP isn't good enough, then we can drop it altogether and
> someone
> >> can rethink the whole thing. Should we do some kind of informal voting
> and
> >> close this thread?
> >>
> >> On Mon, Aug 4, 2025 at 3:32 PM Jarek Potiuk <[email protected]> wrote:
> >>
> >>>>> My main concern with this right now is the serialisation format of
> >> DAGs
> >>> —
> >>>> it wasn’t really designed with remote submission in mind, so it need
> >> some
> >>>> careful examination to see if it is fit for this purpose or not.
> >>>
> >>> I understand Ash's concerns - the format has not been designed with
> >>> size/speed optimization in mind so **possibly** we could design a
> >> different
> >>> format that would be better suited.
> >>>
> >>> BUT  ... Done is better than perfect.
> >>>
> >>> I think there are a number of risks involved in changing the format and
> >> it
> >>> could significantly increase time of development with uncertain gains
> at
> >>> the end - also because of the progress in compression that happened
> over
> >>> the last few years.
> >>>
> >>> It might be a good idea to experiment a bit with different compression
> >>> algorithms for "our" dag representation and possibly we could find the
> >> best
> >>> algorithm for "airflow dag" type of json data. There are a lot of
> >>> repetitions in the JSON representation and I guess in "our" json
> >>> representation there are some artifacts and repeated sections that
> simply
> >>> might compress well with different algorithms. Also in this case
> >>> speed matters (and CPU trade-off).
> >>>
> >>> Looking at compression "theory" - before we experiment with it - there
> is
> >>> the relatively new standard "zstandard"
> https://github.com/facebook/zstd
> >>> compression opensourced in 2016 which I've heard good things about -
> >>> especially that it maintains a very good compression rate for text
> data,
> >>> but also it is tremendously fast - especially for decompression (which
> is
> >>> super important factor for us - we compress new DAG representation far
> >> less
> >>> often than decompress it in general case). It is standardized in RFC
> >>> https://datatracker.ietf.org/doc/html/rfc8878 and there are various
> >>> implementations and it is even being added to Python standard library
> in
> >>> Python 3.14 https://docs.python.org/3.14/library/compression.zstd.html
> >> and
> >>> there is a very well maintained python binding library
> >>> https://pypi.org/project/zstd/ to Yann Collet (algorithm author) ZSTD
> C
> >>> library. And libzstd is already part of our images - it is needed by
> >> other
> >>> dependencies of ours. All with BSD licence, directly usable by us.
> >>>
> >>> I think this one might be a good candidate for us to try, and possibly
> >> with
> >>> zstd we could achieve both size and CPU overhead that would be
> comparable
> >>> with any "new" format we could come up with - especially that we are
> >>> talking merely about processing a huge blob between "storable"
> >> (compressed)
> >>> and "locally usable" state (Python dict). We could likely use a
> streaming
> >>> JSON library (say the one that is used in Pydantic internally
> >>> https://github.com/pydantic/jiter - we already have it as part of
> >>> Pydantic)
> >>> to also save memory - we could stream decompressed stream into jitter
> so
> >>> that both the json dict and string representation does not have to be
> >>> loaded fully in memory at the same time. There are likely lots of
> >>> optimisations we could do - I mentioned possibly streaming the data
> from
> >>> API directly to DB (if this is possible - not sure)
> >>>
> >>> J.
> >>>
> >>>
> >>> On Mon, Aug 4, 2025 at 9:10 AM Sumit Maheshwari <
> [email protected]>
> >>> wrote:
> >>>
> >>>>>
> >>>>> My main concern with this right now is the serialisation format of
> >>> DAGs —
> >>>>> it wasn’t really designed with remote submission in mind, so it need
> >>> some
> >>>>> careful examination to see if it is fit for this purpose or not.
> >>>>>
> >>>>
> >>>> I'm not sure on this point, cause if we are able to convert a DAG into
> >>>> JSON, then it has to be transferable over the internet.
> >>>>
> >>>> In particular One of the things I worry about is that the JSON can get
> >>> huge
> >>>>> — I’ve seem this as large as 10-20Mb for some dags
> >>>>
> >>>>
> >>>> Yeah, agree on this, thats why we can transfer compressed data instead
> >> of
> >>>> real json. Of course, this won't guarantee that the payload will
> always
> >>> be
> >>>> small enough, but we can't say that it'll definitely happen either.
> >>>>
> >>>> I also wonder if as part of this proposal we should move the Callback
> >>>>> requests off the dag parsers and on to the workers instead
> >>>>
> >>>> let's make such a "workfload" implementation stream that could support
> >>> both
> >>>>> - Deadlines and DAG parsing logic
> >>>>
> >>>>
> >>>> I don't have any strong opinion here, but it feels like it's gonna
> blow
> >>> up
> >>>> the scope of the AIP too much.
> >>>>
> >>>>
> >>>> On Fri, Aug 1, 2025 at 2:27 AM Jarek Potiuk <[email protected]> wrote:
> >>>>
> >>>>>> My main concern with this right now is the serialisation format of
> >>>> DAGs —
> >>>>> it wasn’t really designed with remote submission in mind, so it need
> >>> some
> >>>>> careful examination to see if it is fit for this purpose or not.
> >>>>>
> >>>>> Yep. That might be potentially a problem (or at least "need more
> >>>> resources
> >>>>> to run airflow") and that is where my "2x memory" came from if we do
> >> it
> >>>> in
> >>>>> a trivial way. Currently we a) keep the whole DAG in memory when
> >>>>> serializing it b) submit it to database (also using essentially some
> >>> kind
> >>>>> of API (implemented by the database client) - so we know the whole
> >>> thing
> >>>>> "might work" but indeed if you use a trivial implementation of
> >>> submitting
> >>>>> the whole json - it basically means that the whole json will have to
> >>> also
> >>>>> be kept in the memory of API server. But we also compress it when
> >>> needed
> >>>> -
> >>>>> I wonder what are the compression ratios we saw with those 10-20MBs
> >>> Dags
> >>>> -
> >>>>> if the problem is using strings where bool would suffice, compression
> >>>>> should generally help a lot. We could only ever send compressed data
> >>> over
> >>>>> the API - there seems to be no need to send "plain JSON" data over
> >> the
> >>>> API
> >>>>> or storing the plain JSON in the DB (of course that trades memory for
> >>>> CPU).
> >>>>>
> >>>>> I wonder if sqlalchemy 2 (and drivers for MySQL/Postgres) have
> >> support
> >>>> for
> >>>>> any kind if binary data streaming - because that could help a lot of
> >> if
> >>>> we
> >>>>> could use streaming HTTP API and chunk and append the binary chunks
> >>> (when
> >>>>> writing) - or read data in chunks ans stream them back via the API.
> >>> That
> >>>>> could seriously decrease the amount of memory needed by the API
> >> server
> >>> to
> >>>>> process such huge serialized dags.
> >>>>>
> >>>>> And yeah - I would also love the "execute task" to be implemented
> >> here
> >>> -
> >>>>> but I am not sure if this should be part of the same effort or maybe
> >> a
> >>>>> separate implementation? That sounds very loosely coupled with DB
> >>>>> isolation. And it seems a common theme - I think that would also make
> >>> the
> >>>>> sync Deadline alerts case that we discussed at the dev call today. I
> >>>> wonder
> >>>>> if that should not be kind of parallel (let's make such a "workfload"
> >>>>> implementation stream that could support both - Deadlines and DAG
> >>> parsing
> >>>>> logic. We have already two "users" for it and I really love the
> >> saying
> >>>> "if
> >>>>> you want to make something reusable - make it usable first"  - seems
> >>> like
> >>>>> we might have good opportunity to make such workload implementation
> >>>> "doubly
> >>>>> used"  from the beginning which would increase chances it will be
> >>>>> "reusable" for other things as well :).
> >>>>>
> >>>>> J.
> >>>>>
> >>>>>
> >>>>> On Thu, Jul 31, 2025 at 12:28 PM Ash Berlin-Taylor <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>>> My main concern with this right now is the serialisation format of
> >>>> DAGs —
> >>>>>> it wasn’t really designed with remote submission in mind, so it
> >> need
> >>>> some
> >>>>>> careful examination to see if it is fit for this purpose or not.
> >>>>>>
> >>>>>> In particular One of the things I worry about is that the JSON can
> >>> get
> >>>>>> huge — I’ve seem this as large as 10-20Mb for some dags(!!) (which
> >> is
> >>>>>> likely due to things being included as text when a bool might
> >>> suffice,
> >>>>> for
> >>>>>> example) But I don’t think “just submit the existing JSON over an
> >>> API”
> >>>>> is a
> >>>>>> good idea.
> >>>>>>
> >>>>>> I also wonder if as part of this proposal we should move the
> >> Callback
> >>>>>> requests off the dag parsers and on to the workers instead — in
> >>> AIP-72
> >>>> we
> >>>>>> introduced the concept of a Workload, with the only one existing
> >>> right
> >>>>> now
> >>>>>> is “ExecuteTask”
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/airflow/blob/8e1201c7713d5c677fa6f6d48bbd4f6903505f61/airflow-core/src/airflow/executors/workloads.py#L87-L88
> >>>>>> — it might be time to finally move task and dag callbacks to the
> >> same
> >>>>> thing
> >>>>>> and make dag parsers only responsible for, well, parsing. :)
> >>>>>>
> >>>>>> These are all solvable problems, and this will be a great feature
> >> to
> >>>>> have,
> >>>>>> but we need to do some more thinking and planning first.
> >>>>>>
> >>>>>> -ash
> >>>>>>
> >>>>>>> On 31 Jul 2025, at 10:12, Sumit Maheshwari <
> >> [email protected]
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Gentle reminder for everyone to review the proposal.
> >>>>>>>
> >>>>>>> Updated link:
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-92+Isolate+DAG+processor%2C+Callback+processor%2C+and+Triggerer+from+core+services
> >>>>>>>
> >>>>>>> On Tue, Jul 29, 2025 at 4:37 PM Sumit Maheshwari <
> >>>>> [email protected]
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks everyone for reviewing this AIP. As Jarek and others
> >>>>> suggested, I
> >>>>>>>> expanded the scope of this AIP and divided it into three phases.
> >>>> With
> >>>>>> the
> >>>>>>>> increased scope, the boundary line between this AIP and AIP-85
> >>> got a
> >>>>>> little
> >>>>>>>> thinner, but I believe these are still two different
> >> enhancements
> >>> to
> >>>>>> make.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jul 25, 2025 at 10:51 PM Sumit Maheshwari <
> >>>>>> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Yeah, overall it makes sense to include Triggers as well to be
> >>> part
> >>>>> of
> >>>>>>>>> this AIP and phase out the implementation. Though I didn't
> >>> exclude
> >>>>>> Triggers
> >>>>>>>>> because "Uber" doesn't need that, I just thought of keeping the
> >>>> scope
> >>>>>> of
> >>>>>>>>> development small and achieving them, just like it was done in
> >>>>> Airlfow
> >>>>>> 3 by
> >>>>>>>>> secluding only workers and not DAG-processor & Triggers.
> >>>>>>>>>
> >>>>>>>>> But if you think Triggers should be part of this AIP itself,
> >>> then I
> >>>>> can
> >>>>>>>>> do that and include Triggers as well in it.
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 25, 2025 at 7:34 PM Jarek Potiuk <[email protected]
> >>>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I would very much prefer the architectural choices of this AIP
> >>> are
> >>>>>> based
> >>>>>>>>>> on
> >>>>>>>>>> "general public" needs rather than "Uber needs" even if Uber
> >>> will
> >>>> be
> >>>>>>>>>> implementing it - so from my point of view having Trigger
> >>>> separation
> >>>>>> as
> >>>>>>>>>> part of it is quite important.
> >>>>>>>>>>
> >>>>>>>>>> But that's not even this.
> >>>>>>>>>>
> >>>>>>>>>> We've been discussing for example for Deadlines (being
> >>> implemented
> >>>>> by
> >>>>>>>>>> Dennis and Ramit   a possibility of short, notification-style
> >>>>>> "deadlines"
> >>>>>>>>>> to be send to triggerer for execution - this is well advanced
> >>> now,
> >>>>> and
> >>>>>>>>>> whether you want it or not Dag-provided code might be
> >> serialized
> >>>> and
> >>>>>> sent
> >>>>>>>>>> to triggerer for execution. This is part of our "broader"
> >>>>>> architectural
> >>>>>>>>>> change where we treat "workers" and "triggerer" similarly as a
> >>>>> general
> >>>>>>>>>> executors of "sync" and "async" tasks respectively. That's
> >> where
> >>>>>> Airflow
> >>>>>>>>>> is
> >>>>>>>>>> evolving towards - inevitably.
> >>>>>>>>>>
> >>>>>>>>>> But we can of course phase things in out for implementation -
> >>> even
> >>>>> if
> >>>>>> AIP
> >>>>>>>>>> should cover both, I think if the goal of the AIP and preamble
> >>> is
> >>>>>> about
> >>>>>>>>>> separating "user code" from "database" as the main reason, it
> >>> also
> >>>>>> means
> >>>>>>>>>> Triggerer if you ask me (from design point of view at least).
> >>>>>>>>>>
> >>>>>>>>>> Again implementation can be phased and even different people
> >> and
> >>>>> teams
> >>>>>>>>>> might work on those phases/pieces.
> >>>>>>>>>>
> >>>>>>>>>> J.
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 25, 2025 at 2:29 PM Sumit Maheshwari <
> >>>>>> [email protected]
> >>>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> #2. Yeah, we would need something similar for triggerers as
> >>>> well,
> >>>>>>>>>> but
> >>>>>>>>>>>> that
> >>>>>>>>>>>> can be done as part of a different AIP
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> You won't achieve your goal of "true" isolation of user code
> >> if
> >>>> you
> >>>>>>>>>> don't
> >>>>>>>>>>>> do triggerer. I think if the goal is to achieve it - it
> >> should
> >>>>> cover
> >>>>>>>>>>> both.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> My bad, should've explained our architecture for triggers as
> >>>> well,
> >>>>>>>>>>> apologies. So here it is:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>  - Triggers would be running on a centralized service, so
> >> all
> >>>> the
> >>>>>>>>>> Trigger
> >>>>>>>>>>>  classes will be part of the platform team's repo and not
> >> the
> >>>>>>>>>> customer's
> >>>>>>>>>>> repo
> >>>>>>>>>>>  - The triggers won't be able to use any libs other than std
> >>>> ones,
> >>>>>>>>>> which
> >>>>>>>>>>>  are being used in core Airflow (like requests, etc)
> >>>>>>>>>>>  - As we are the owners of the core Airflow repo, customers
> >>> have
> >>>>> to
> >>>>>>>>>> get
> >>>>>>>>>>>  our approval to land any class in this path (unlike the
> >> dags
> >>>> repo
> >>>>>>>>>> which
> >>>>>>>>>>>  they own)
> >>>>>>>>>>>  - When a customer's task defer, we would have an allowlist
> >> on
> >>>> our
> >>>>>>>>>> side
> >>>>>>>>>>>  to check if we should do the async polling or not
> >>>>>>>>>>>  - If the Trigger class isn't part of our repo (allowlist),
> >>> just
> >>>>>>>>>> fail the
> >>>>>>>>>>>  task, as anyway we won't be having the code that they used
> >> in
> >>>> the
> >>>>>>>>>>> trigger
> >>>>>>>>>>>  class
> >>>>>>>>>>>  - If any of these conditions aren't suitable for you (as a
> >>>>>>>>>> customer),
> >>>>>>>>>>>  feel free to use sync tasks only
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> But in general, I agree to make triggerer svc also
> >> communicate
> >>>> over
> >>>>>>>>>> apis
> >>>>>>>>>>> only. If that is done, then we can have instances of
> >> triggerer
> >>>> svc
> >>>>>>>>>> running
> >>>>>>>>>>> at customer's side as well, which can process any type of
> >>> trigger
> >>>>>>>>>> class.
> >>>>>>>>>>> Though that's not a blocker for us at the moment, cause
> >>> triggerer
> >>>>> are
> >>>>>>>>>>> mostly doing just polling using simple libs like requests.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Jul 25, 2025 at 5:03 PM Igor Kholopov
> >>>>>>>>>> <[email protected]
> >>>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks Sumit for the detailed proposal. Overall I believe it
> >>>>> aligns
> >>>>>>>>>> well
> >>>>>>>>>>>> with the goals of making Airflow well-scalable beyond a
> >>>>> single-team
> >>>>>>>>>>>> deployment (and AIP-85 goals), so you have my full support
> >>> with
> >>>>> this
> >>>>>>>>>> one.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I've left a couple of clarification requests on the AIP
> >> page.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Igor
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Jul 25, 2025 at 11:50 AM Sumit Maheshwari <
> >>>>>>>>>>> [email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks Jarek and Ash, for the initial review. It's good to
> >>> know
> >>>>>>>>>> that
> >>>>>>>>>>> the
> >>>>>>>>>>>>> DAG processor has some preemptive measures in place to
> >>> prevent
> >>>>>>>>>> access
> >>>>>>>>>>>>> to the DB. However, the main issue we are trying to solve
> >> is
> >>>> not
> >>>>> to
> >>>>>>>>>>>> provide
> >>>>>>>>>>>>> DB creds to the customer teams, who are using Airflow as a
> >>>>>>>>>> multi-tenant
> >>>>>>>>>>>>> orchestration platform. I've updated the doc to reflect
> >> this
> >>>>> point
> >>>>>>>>>> as
> >>>>>>>>>>>> well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Answering Jarek's points,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> #1. Yeah, had forgot to write about token mechanism, added
> >>> that
> >>>>> in
> >>>>>>>>>> doc,
> >>>>>>>>>>>> but
> >>>>>>>>>>>>> still how the token can be obtained (safely) is still open
> >> in
> >>>> my
> >>>>>>>>>> mind.
> >>>>>>>>>>> I
> >>>>>>>>>>>>> believe the token used by task executors can be created
> >>> outside
> >>>>> of
> >>>>>>>>>> it
> >>>>>>>>>>> as
> >>>>>>>>>>>>> well (I may be wrong here).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> #2. Yeah, we would need something similar for triggerers as
> >>>> well,
> >>>>>>>>>> but
> >>>>>>>>>>>> that
> >>>>>>>>>>>>> can be done as part of a different AIP
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> #3. Yeah, I also believe the API should work largely.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> #4. Added that in the AIP, that instead of dag_dirs we can
> >>> work
> >>>>>>>>>> with
> >>>>>>>>>>>>> dag_bundles and every dag-processor instance would be
> >> treated
> >>>> as
> >>>>> a
> >>>>>>>>>> diff
> >>>>>>>>>>>>> bundle.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Also, added points around callbacks, as these are also
> >>> fetched
> >>>>>>>>>> directly
> >>>>>>>>>>>>> from the DB.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Jul 25, 2025 at 11:58 AM Jarek Potiuk <
> >>>> [email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> A clarification to this - the dag parser today is likely
> >>> not
> >>>>>>>>>>>> protection
> >>>>>>>>>>>>>> against a dedicated malicious DAG author, but it does
> >>> protect
> >>>>>>>>>> against
> >>>>>>>>>>>>>> casual DB access attempts - the db session is blanked out
> >> in
> >>>> the
> >>>>>>>>>>>> parsing
> >>>>>>>>>>>>>> process , as are the env var configs
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L274-L316
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> is this perfect no? but it’s much more than no protection
> >>>>>>>>>>>>>> Oh absolutely.. This is exactly what we discussed back
> >> then
> >>> in
> >>>>>>>>>> March
> >>>>>>>>>>> I
> >>>>>>>>>>>>>> think - and the way we decided to go for 3.0 with full
> >>>> knowledge
> >>>>>>>>>> it's
> >>>>>>>>>>>> not
> >>>>>>>>>>>>>> protecting against all threats.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Jul 25, 2025 at 8:22 AM Ash Berlin-Taylor <
> >>>>>>>>>> [email protected]>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> A clarification to this - the dag parser today is likely
> >>> not
> >>>>>>>>>>>> protection
> >>>>>>>>>>>>>>> against a dedicated malicious DAG author, but it does
> >>> protect
> >>>>>>>>>>> against
> >>>>>>>>>>>>>>> casual DB access attempts - the db session is blanked out
> >>> in
> >>>>>>>>>> the
> >>>>>>>>>>>>> parsing
> >>>>>>>>>>>>>>> process , as are the env var configs
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L274-L316
> >>>>>>>>>>>>>>> - is this perfect no? but it’s much more than no
> >> protection
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 24 Jul 2025, at 21:56, Jarek Potiuk <
> >> [email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Currently in the DagFile processor there is no  built-in
> >>>>>>>>>>> protection
> >>>>>>>>>>>>>>> against
> >>>>>>>>>>>>>>>> user code from Dag Parsing to - for example - read
> >>> database
> >>>>>>>>>>>>>>>> credentials from airflow configuration and use them to
> >>> talk
> >>>>>>>>>> to DB
> >>>>>>>>>>>>>>> directly.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Reply via email to