Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Jarek Potiuk Thu, 07 Aug 2025 05:32:19 -0700

Also

> 1. The Authentication token. How will this long lived token work without
being insecure. Who and what will generate it? How will we identify
top-level requests for Variables in order to be able to add Variable
RBAC/ACLs. This is an important enough thing that I think it needs
discussion before we vote on this AIP.


We are currently discussing - in the security team - approach for JWT token
handling, so likely we could move the discussion there, it does have some
security implications and I think we should bring our finding to the
devlist when we complete it, but I think we should add this case there.
IMHO we should have a different approach for UI, different for Tasks,
different for Triggerer, and different for DagProcessor. (possibly the
Trigerer and DagProcessor could be the same because they share essentially
the same long-living token. Ash - I will add this to the discussion there.

J.



On Thu, Aug 7, 2025 at 2:23 PM Jarek Potiuk <[email protected]> wrote:

> Ah.. So if we are talking about a more complete approach - seeing those
> comments from Ash - make me think if we should have another AIP.
> (connected) about splitting the distributions. We have never finalized it
> (nor even discussed id) but Ash - you had some initial document for that.
> So maybe we should finalize it and rather than specify it in this AIP -
> have a separate AIP about distribution split that AIP-92 could depend on.
> It seems much more reasonable to split "distribution and code split" from
> parsing isolation I think and implement them separately/in parallel.
>
> Reading Ash comments (and maybe I am going a bit further than Ash) it
> calls for something that I am a big proponent of - splitting "airflow-core"
> and having a different "scheduler". "webserver", "dag processor" and
> "triggerer" distributions. Now - we have the capability of having "shared"
> code - we do not need "common" code to make it happen - because we can
> share code.
>
> What it could give us - on top of clean client/server split, we could have
> different dependencies used by those distributions. Additionally, we could
> also split-off the executors from providers and finally implement it in the
> way that scheduler does not use providers at all (not even cncf.kubernetes
> nor celery providers installed in scheduler nor webserver but "executors"
> packages instead. The code sharing approach with symlinks we have now will
> make it a .... breeze :) . That would also imply sharing "connection"
> definitions through DB, and likely implementing "test connection"
> feature properly finally (i.e executing test connection in worker /
> triggerer rather than in web server which is a reason why we disabled it by
> default now). This way "api-server" would not need any of the providers to
> be installed either which IMHO is the biggest win from a security point of
> view.
>
> And the nice thing about it is that it would be rather transparent when
> anyone uses "pip install apache-airflow" - it would behave exactly the
> same, no more complexity involved, simply more distributions installed when
> 'apache-airflow" meta-distribution is used, but it would allow those who
> want to implement a more complex and secure setup to have different
> "environments" with modularized pieces of airflow installed - only
> "apache-airflow-dag-processor + task-sdk + providers" where dag-processor
> is run, only "apache-airflow-scheduler + executors" where scheduler is
> installed only "apache-airflow-task-sdk + providers" where workers are
> running, only "apache-airflow-api-server" where api-server is running and
> only "apache-airflow-trigger + task-sdk + providers" .
>
> I am happy (Ash If you are fine with that) to take that original document
> over and lead this part and new AIP to completion (including
> implementation), I am very much convinced that this will lead to much
> better dependency security and more modular code without impacting the
> "apache-airflow" installation complexity.
>
> If we do it this way- the part of code/clean split would be "delegated
> out" from AIP-92 to this new AIP and turned into dependency.
>
> J.
>
>
> On Thu, Aug 7, 2025 at 1:51 PM Ash Berlin-Taylor <[email protected]> wrote:
>
>> This AIP is definitely heading in the right direction and is a feature
>> I’d like to see.
>>
>> For me the outstanding things that need more detail:
>>
>> 1. The Authentication token. How will this long lived token work without
>> being insecure. Who and what will generate it? How will we identify
>> top-level requests for Variables in order to be able to add Variable
>> RBAC/ACLs. This is an important enough thing that I think it needs
>> discussion before we vote on this AIP.
>> 2. Security generally — how will this work, especially with the
>> multi-team? I think this likely means making the APIs work on the bundle
>> level as you mention in the doc, but I haven’t thought deeply about this
>> yet.
>> 3. API Versioning? One of the the key driving goals with AIP-72 and the
>> Task Execution SDK was the idea that “you can upgrade the API server as you
>> like, and your clients/workers never need to work” — i.e. the API server is
>> 100% working with all older versions of the TaskSDK. I don’t know if we
>> will achieve that goal in the long run but it is the desire, and part of
>> why we are using CalVer and the Cadwyn library to provide API versioning.
>> 4. As mentioned previously, not sure the existing serialised JSON format
>> for DAGs is correct, but since that now has version and we already have the
>> ability to upgrade that somewhere in the Airflow Core that doesn’t
>> necessarily become a blocker/pre-requisite for this AIP.
>>
>> I think Dag parsing API client+submission+parsing process manager should
>> either live in the Task SDK dist, or in a new separate dist that uses
>> TaskSDK, but crucially not in apache-airflow-core. My reason for this is
>> that I want it to be possible for the server components (scheduler, API
>> server) to not need task-sdk installed (just for cleanliness/avoiding
>> confusion about what versions it needs) and also vice-verse, to be able to
>> run a “team worker bundle” (Dag parsing, workers, triggered/async workers)
>> on whatever version of TaskSDK they choose, again without
>> apache-airflow-core installed for avoidance of doubt.
>>
>> Generally I would like this as it means we can have a nicer separation of
>> Core and Dag parsing code, as the dag parsing itself uses the SDK, it would
>> be nice to have a proper server/client split, both from a tighter security
>> point-of-view, but also from a code layout point of view.
>>
>> -ash
>>
>>
>> > On 7 Aug 2025, at 12:36, Jarek Potiuk <[email protected]> wrote:
>> >
>> > Well, you started it - so it's up to you to decide if you think we have
>> > consensus, or whether we need a vote.
>> >
>> > And It's not a question of "informal" vote but it's rather clear
>> following
>> > the https://www.apache.org/foundation/voting.html that we either need a
>> > LAZY CONSENSUS or VOTE thread. Both are formal.
>> >
>> > This is the difficult part when you have a proposal, to assess (by you)
>> > whether we are converging to consensus or whether vote is needed. There
>> is
>> > no other body or "authority" to do it for you.
>> >
>> > J.
>> >
>> > On Thu, Aug 7, 2025 at 1:02 PM Sumit Maheshwari <[email protected]
>> >
>> > wrote:
>> >
>> >> Sorry for nudging again, but can we get into some consensus on this? I
>> mean
>> >> if this AIP isn't good enough, then we can drop it altogether and
>> someone
>> >> can rethink the whole thing. Should we do some kind of informal voting
>> and
>> >> close this thread?
>> >>
>> >> On Mon, Aug 4, 2025 at 3:32 PM Jarek Potiuk <[email protected]> wrote:
>> >>
>> >>>>> My main concern with this right now is the serialisation format of
>> >> DAGs
>> >>> —
>> >>>> it wasn’t really designed with remote submission in mind, so it need
>> >> some
>> >>>> careful examination to see if it is fit for this purpose or not.
>> >>>
>> >>> I understand Ash's concerns - the format has not been designed with
>> >>> size/speed optimization in mind so **possibly** we could design a
>> >> different
>> >>> format that would be better suited.
>> >>>
>> >>> BUT  ... Done is better than perfect.
>> >>>
>> >>> I think there are a number of risks involved in changing the format
>> and
>> >> it
>> >>> could significantly increase time of development with uncertain gains
>> at
>> >>> the end - also because of the progress in compression that happened
>> over
>> >>> the last few years.
>> >>>
>> >>> It might be a good idea to experiment a bit with different compression
>> >>> algorithms for "our" dag representation and possibly we could find the
>> >> best
>> >>> algorithm for "airflow dag" type of json data. There are a lot of
>> >>> repetitions in the JSON representation and I guess in "our" json
>> >>> representation there are some artifacts and repeated sections that
>> simply
>> >>> might compress well with different algorithms. Also in this case
>> >>> speed matters (and CPU trade-off).
>> >>>
>> >>> Looking at compression "theory" - before we experiment with it -
>> there is
>> >>> the relatively new standard "zstandard"
>> https://github.com/facebook/zstd
>> >>> compression opensourced in 2016 which I've heard good things about -
>> >>> especially that it maintains a very good compression rate for text
>> data,
>> >>> but also it is tremendously fast - especially for decompression
>> (which is
>> >>> super important factor for us - we compress new DAG representation far
>> >> less
>> >>> often than decompress it in general case). It is standardized in RFC
>> >>> https://datatracker.ietf.org/doc/html/rfc8878 and there are various
>> >>> implementations and it is even being added to Python standard library
>> in
>> >>> Python 3.14
>> https://docs.python.org/3.14/library/compression.zstd.html
>> >> and
>> >>> there is a very well maintained python binding library
>> >>> https://pypi.org/project/zstd/ to Yann Collet (algorithm author)
>> ZSTD C
>> >>> library. And libzstd is already part of our images - it is needed by
>> >> other
>> >>> dependencies of ours. All with BSD licence, directly usable by us.
>> >>>
>> >>> I think this one might be a good candidate for us to try, and possibly
>> >> with
>> >>> zstd we could achieve both size and CPU overhead that would be
>> comparable
>> >>> with any "new" format we could come up with - especially that we are
>> >>> talking merely about processing a huge blob between "storable"
>> >> (compressed)
>> >>> and "locally usable" state (Python dict). We could likely use a
>> streaming
>> >>> JSON library (say the one that is used in Pydantic internally
>> >>> https://github.com/pydantic/jiter - we already have it as part of
>> >>> Pydantic)
>> >>> to also save memory - we could stream decompressed stream into jitter
>> so
>> >>> that both the json dict and string representation does not have to be
>> >>> loaded fully in memory at the same time. There are likely lots of
>> >>> optimisations we could do - I mentioned possibly streaming the data
>> from
>> >>> API directly to DB (if this is possible - not sure)
>> >>>
>> >>> J.
>> >>>
>> >>>
>> >>> On Mon, Aug 4, 2025 at 9:10 AM Sumit Maheshwari <
>> [email protected]>
>> >>> wrote:
>> >>>
>> >>>>>
>> >>>>> My main concern with this right now is the serialisation format of
>> >>> DAGs —
>> >>>>> it wasn’t really designed with remote submission in mind, so it need
>> >>> some
>> >>>>> careful examination to see if it is fit for this purpose or not.
>> >>>>>
>> >>>>
>> >>>> I'm not sure on this point, cause if we are able to convert a DAG
>> into
>> >>>> JSON, then it has to be transferable over the internet.
>> >>>>
>> >>>> In particular One of the things I worry about is that the JSON can
>> get
>> >>> huge
>> >>>>> — I’ve seem this as large as 10-20Mb for some dags
>> >>>>
>> >>>>
>> >>>> Yeah, agree on this, thats why we can transfer compressed data
>> instead
>> >> of
>> >>>> real json. Of course, this won't guarantee that the payload will
>> always
>> >>> be
>> >>>> small enough, but we can't say that it'll definitely happen either.
>> >>>>
>> >>>> I also wonder if as part of this proposal we should move the Callback
>> >>>>> requests off the dag parsers and on to the workers instead
>> >>>>
>> >>>> let's make such a "workfload" implementation stream that could
>> support
>> >>> both
>> >>>>> - Deadlines and DAG parsing logic
>> >>>>
>> >>>>
>> >>>> I don't have any strong opinion here, but it feels like it's gonna
>> blow
>> >>> up
>> >>>> the scope of the AIP too much.
>> >>>>
>> >>>>
>> >>>> On Fri, Aug 1, 2025 at 2:27 AM Jarek Potiuk <[email protected]>
>> wrote:
>> >>>>
>> >>>>>> My main concern with this right now is the serialisation format of
>> >>>> DAGs —
>> >>>>> it wasn’t really designed with remote submission in mind, so it need
>> >>> some
>> >>>>> careful examination to see if it is fit for this purpose or not.
>> >>>>>
>> >>>>> Yep. That might be potentially a problem (or at least "need more
>> >>>> resources
>> >>>>> to run airflow") and that is where my "2x memory" came from if we do
>> >> it
>> >>>> in
>> >>>>> a trivial way. Currently we a) keep the whole DAG in memory when
>> >>>>> serializing it b) submit it to database (also using essentially some
>> >>> kind
>> >>>>> of API (implemented by the database client) - so we know the whole
>> >>> thing
>> >>>>> "might work" but indeed if you use a trivial implementation of
>> >>> submitting
>> >>>>> the whole json - it basically means that the whole json will have to
>> >>> also
>> >>>>> be kept in the memory of API server. But we also compress it when
>> >>> needed
>> >>>> -
>> >>>>> I wonder what are the compression ratios we saw with those 10-20MBs
>> >>> Dags
>> >>>> -
>> >>>>> if the problem is using strings where bool would suffice,
>> compression
>> >>>>> should generally help a lot. We could only ever send compressed data
>> >>> over
>> >>>>> the API - there seems to be no need to send "plain JSON" data over
>> >> the
>> >>>> API
>> >>>>> or storing the plain JSON in the DB (of course that trades memory
>> for
>> >>>> CPU).
>> >>>>>
>> >>>>> I wonder if sqlalchemy 2 (and drivers for MySQL/Postgres) have
>> >> support
>> >>>> for
>> >>>>> any kind if binary data streaming - because that could help a lot of
>> >> if
>> >>>> we
>> >>>>> could use streaming HTTP API and chunk and append the binary chunks
>> >>> (when
>> >>>>> writing) - or read data in chunks ans stream them back via the API.
>> >>> That
>> >>>>> could seriously decrease the amount of memory needed by the API
>> >> server
>> >>> to
>> >>>>> process such huge serialized dags.
>> >>>>>
>> >>>>> And yeah - I would also love the "execute task" to be implemented
>> >> here
>> >>> -
>> >>>>> but I am not sure if this should be part of the same effort or maybe
>> >> a
>> >>>>> separate implementation? That sounds very loosely coupled with DB
>> >>>>> isolation. And it seems a common theme - I think that would also
>> make
>> >>> the
>> >>>>> sync Deadline alerts case that we discussed at the dev call today. I
>> >>>> wonder
>> >>>>> if that should not be kind of parallel (let's make such a
>> "workfload"
>> >>>>> implementation stream that could support both - Deadlines and DAG
>> >>> parsing
>> >>>>> logic. We have already two "users" for it and I really love the
>> >> saying
>> >>>> "if
>> >>>>> you want to make something reusable - make it usable first"  - seems
>> >>> like
>> >>>>> we might have good opportunity to make such workload implementation
>> >>>> "doubly
>> >>>>> used"  from the beginning which would increase chances it will be
>> >>>>> "reusable" for other things as well :).
>> >>>>>
>> >>>>> J.
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Jul 31, 2025 at 12:28 PM Ash Berlin-Taylor <[email protected]>
>> >>>> wrote:
>> >>>>>
>> >>>>>> My main concern with this right now is the serialisation format of
>> >>>> DAGs —
>> >>>>>> it wasn’t really designed with remote submission in mind, so it
>> >> need
>> >>>> some
>> >>>>>> careful examination to see if it is fit for this purpose or not.
>> >>>>>>
>> >>>>>> In particular One of the things I worry about is that the JSON can
>> >>> get
>> >>>>>> huge — I’ve seem this as large as 10-20Mb for some dags(!!) (which
>> >> is
>> >>>>>> likely due to things being included as text when a bool might
>> >>> suffice,
>> >>>>> for
>> >>>>>> example) But I don’t think “just submit the existing JSON over an
>> >>> API”
>> >>>>> is a
>> >>>>>> good idea.
>> >>>>>>
>> >>>>>> I also wonder if as part of this proposal we should move the
>> >> Callback
>> >>>>>> requests off the dag parsers and on to the workers instead — in
>> >>> AIP-72
>> >>>> we
>> >>>>>> introduced the concept of a Workload, with the only one existing
>> >>> right
>> >>>>> now
>> >>>>>> is “ExecuteTask”
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://github.com/apache/airflow/blob/8e1201c7713d5c677fa6f6d48bbd4f6903505f61/airflow-core/src/airflow/executors/workloads.py#L87-L88
>> >>>>>> — it might be time to finally move task and dag callbacks to the
>> >> same
>> >>>>> thing
>> >>>>>> and make dag parsers only responsible for, well, parsing. :)
>> >>>>>>
>> >>>>>> These are all solvable problems, and this will be a great feature
>> >> to
>> >>>>> have,
>> >>>>>> but we need to do some more thinking and planning first.
>> >>>>>>
>> >>>>>> -ash
>> >>>>>>
>> >>>>>>> On 31 Jul 2025, at 10:12, Sumit Maheshwari <
>> >> [email protected]
>> >>>>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> Gentle reminder for everyone to review the proposal.
>> >>>>>>>
>> >>>>>>> Updated link:
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-92+Isolate+DAG+processor%2C+Callback+processor%2C+and+Triggerer+from+core+services
>> >>>>>>>
>> >>>>>>> On Tue, Jul 29, 2025 at 4:37 PM Sumit Maheshwari <
>> >>>>> [email protected]
>> >>>>>>>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Thanks everyone for reviewing this AIP. As Jarek and others
>> >>>>> suggested, I
>> >>>>>>>> expanded the scope of this AIP and divided it into three phases.
>> >>>> With
>> >>>>>> the
>> >>>>>>>> increased scope, the boundary line between this AIP and AIP-85
>> >>> got a
>> >>>>>> little
>> >>>>>>>> thinner, but I believe these are still two different
>> >> enhancements
>> >>> to
>> >>>>>> make.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Fri, Jul 25, 2025 at 10:51 PM Sumit Maheshwari <
>> >>>>>> [email protected]>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Yeah, overall it makes sense to include Triggers as well to be
>> >>> part
>> >>>>> of
>> >>>>>>>>> this AIP and phase out the implementation. Though I didn't
>> >>> exclude
>> >>>>>> Triggers
>> >>>>>>>>> because "Uber" doesn't need that, I just thought of keeping the
>> >>>> scope
>> >>>>>> of
>> >>>>>>>>> development small and achieving them, just like it was done in
>> >>>>> Airlfow
>> >>>>>> 3 by
>> >>>>>>>>> secluding only workers and not DAG-processor & Triggers.
>> >>>>>>>>>
>> >>>>>>>>> But if you think Triggers should be part of this AIP itself,
>> >>> then I
>> >>>>> can
>> >>>>>>>>> do that and include Triggers as well in it.
>> >>>>>>>>>
>> >>>>>>>>> On Fri, Jul 25, 2025 at 7:34 PM Jarek Potiuk <[email protected]
>> >>>
>> >>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> I would very much prefer the architectural choices of this AIP
>> >>> are
>> >>>>>> based
>> >>>>>>>>>> on
>> >>>>>>>>>> "general public" needs rather than "Uber needs" even if Uber
>> >>> will
>> >>>> be
>> >>>>>>>>>> implementing it - so from my point of view having Trigger
>> >>>> separation
>> >>>>>> as
>> >>>>>>>>>> part of it is quite important.
>> >>>>>>>>>>
>> >>>>>>>>>> But that's not even this.
>> >>>>>>>>>>
>> >>>>>>>>>> We've been discussing for example for Deadlines (being
>> >>> implemented
>> >>>>> by
>> >>>>>>>>>> Dennis and Ramit   a possibility of short, notification-style
>> >>>>>> "deadlines"
>> >>>>>>>>>> to be send to triggerer for execution - this is well advanced
>> >>> now,
>> >>>>> and
>> >>>>>>>>>> whether you want it or not Dag-provided code might be
>> >> serialized
>> >>>> and
>> >>>>>> sent
>> >>>>>>>>>> to triggerer for execution. This is part of our "broader"
>> >>>>>> architectural
>> >>>>>>>>>> change where we treat "workers" and "triggerer" similarly as a
>> >>>>> general
>> >>>>>>>>>> executors of "sync" and "async" tasks respectively. That's
>> >> where
>> >>>>>> Airflow
>> >>>>>>>>>> is
>> >>>>>>>>>> evolving towards - inevitably.
>> >>>>>>>>>>
>> >>>>>>>>>> But we can of course phase things in out for implementation -
>> >>> even
>> >>>>> if
>> >>>>>> AIP
>> >>>>>>>>>> should cover both, I think if the goal of the AIP and preamble
>> >>> is
>> >>>>>> about
>> >>>>>>>>>> separating "user code" from "database" as the main reason, it
>> >>> also
>> >>>>>> means
>> >>>>>>>>>> Triggerer if you ask me (from design point of view at least).
>> >>>>>>>>>>
>> >>>>>>>>>> Again implementation can be phased and even different people
>> >> and
>> >>>>> teams
>> >>>>>>>>>> might work on those phases/pieces.
>> >>>>>>>>>>
>> >>>>>>>>>> J.
>> >>>>>>>>>>
>> >>>>>>>>>> On Fri, Jul 25, 2025 at 2:29 PM Sumit Maheshwari <
>> >>>>>> [email protected]
>> >>>>>>>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> #2. Yeah, we would need something similar for triggerers as
>> >>>> well,
>> >>>>>>>>>> but
>> >>>>>>>>>>>> that
>> >>>>>>>>>>>> can be done as part of a different AIP
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> You won't achieve your goal of "true" isolation of user code
>> >> if
>> >>>> you
>> >>>>>>>>>> don't
>> >>>>>>>>>>>> do triggerer. I think if the goal is to achieve it - it
>> >> should
>> >>>>> cover
>> >>>>>>>>>>> both.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> My bad, should've explained our architecture for triggers as
>> >>>> well,
>> >>>>>>>>>>> apologies. So here it is:
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>  - Triggers would be running on a centralized service, so
>> >> all
>> >>>> the
>> >>>>>>>>>> Trigger
>> >>>>>>>>>>>  classes will be part of the platform team's repo and not
>> >> the
>> >>>>>>>>>> customer's
>> >>>>>>>>>>> repo
>> >>>>>>>>>>>  - The triggers won't be able to use any libs other than std
>> >>>> ones,
>> >>>>>>>>>> which
>> >>>>>>>>>>>  are being used in core Airflow (like requests, etc)
>> >>>>>>>>>>>  - As we are the owners of the core Airflow repo, customers
>> >>> have
>> >>>>> to
>> >>>>>>>>>> get
>> >>>>>>>>>>>  our approval to land any class in this path (unlike the
>> >> dags
>> >>>> repo
>> >>>>>>>>>> which
>> >>>>>>>>>>>  they own)
>> >>>>>>>>>>>  - When a customer's task defer, we would have an allowlist
>> >> on
>> >>>> our
>> >>>>>>>>>> side
>> >>>>>>>>>>>  to check if we should do the async polling or not
>> >>>>>>>>>>>  - If the Trigger class isn't part of our repo (allowlist),
>> >>> just
>> >>>>>>>>>> fail the
>> >>>>>>>>>>>  task, as anyway we won't be having the code that they used
>> >> in
>> >>>> the
>> >>>>>>>>>>> trigger
>> >>>>>>>>>>>  class
>> >>>>>>>>>>>  - If any of these conditions aren't suitable for you (as a
>> >>>>>>>>>> customer),
>> >>>>>>>>>>>  feel free to use sync tasks only
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> But in general, I agree to make triggerer svc also
>> >> communicate
>> >>>> over
>> >>>>>>>>>> apis
>> >>>>>>>>>>> only. If that is done, then we can have instances of
>> >> triggerer
>> >>>> svc
>> >>>>>>>>>> running
>> >>>>>>>>>>> at customer's side as well, which can process any type of
>> >>> trigger
>> >>>>>>>>>> class.
>> >>>>>>>>>>> Though that's not a blocker for us at the moment, cause
>> >>> triggerer
>> >>>>> are
>> >>>>>>>>>>> mostly doing just polling using simple libs like requests.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Fri, Jul 25, 2025 at 5:03 PM Igor Kholopov
>> >>>>>>>>>> <[email protected]
>> >>>>>>>>>>>>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Thanks Sumit for the detailed proposal. Overall I believe it
>> >>>>> aligns
>> >>>>>>>>>> well
>> >>>>>>>>>>>> with the goals of making Airflow well-scalable beyond a
>> >>>>> single-team
>> >>>>>>>>>>>> deployment (and AIP-85 goals), so you have my full support
>> >>> with
>> >>>>> this
>> >>>>>>>>>> one.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I've left a couple of clarification requests on the AIP
>> >> page.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>> Igor
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Fri, Jul 25, 2025 at 11:50 AM Sumit Maheshwari <
>> >>>>>>>>>>> [email protected]>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks Jarek and Ash, for the initial review. It's good to
>> >>> know
>> >>>>>>>>>> that
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>> DAG processor has some preemptive measures in place to
>> >>> prevent
>> >>>>>>>>>> access
>> >>>>>>>>>>>>> to the DB. However, the main issue we are trying to solve
>> >> is
>> >>>> not
>> >>>>> to
>> >>>>>>>>>>>> provide
>> >>>>>>>>>>>>> DB creds to the customer teams, who are using Airflow as a
>> >>>>>>>>>> multi-tenant
>> >>>>>>>>>>>>> orchestration platform. I've updated the doc to reflect
>> >> this
>> >>>>> point
>> >>>>>>>>>> as
>> >>>>>>>>>>>> well.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Answering Jarek's points,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> #1. Yeah, had forgot to write about token mechanism, added
>> >>> that
>> >>>>> in
>> >>>>>>>>>> doc,
>> >>>>>>>>>>>> but
>> >>>>>>>>>>>>> still how the token can be obtained (safely) is still open
>> >> in
>> >>>> my
>> >>>>>>>>>> mind.
>> >>>>>>>>>>> I
>> >>>>>>>>>>>>> believe the token used by task executors can be created
>> >>> outside
>> >>>>> of
>> >>>>>>>>>> it
>> >>>>>>>>>>> as
>> >>>>>>>>>>>>> well (I may be wrong here).
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> #2. Yeah, we would need something similar for triggerers as
>> >>>> well,
>> >>>>>>>>>> but
>> >>>>>>>>>>>> that
>> >>>>>>>>>>>>> can be done as part of a different AIP
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> #3. Yeah, I also believe the API should work largely.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> #4. Added that in the AIP, that instead of dag_dirs we can
>> >>> work
>> >>>>>>>>>> with
>> >>>>>>>>>>>>> dag_bundles and every dag-processor instance would be
>> >> treated
>> >>>> as
>> >>>>> a
>> >>>>>>>>>> diff
>> >>>>>>>>>>>>> bundle.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Also, added points around callbacks, as these are also
>> >>> fetched
>> >>>>>>>>>> directly
>> >>>>>>>>>>>>> from the DB.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Fri, Jul 25, 2025 at 11:58 AM Jarek Potiuk <
>> >>>> [email protected]>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> A clarification to this - the dag parser today is likely
>> >>> not
>> >>>>>>>>>>>> protection
>> >>>>>>>>>>>>>> against a dedicated malicious DAG author, but it does
>> >>> protect
>> >>>>>>>>>> against
>> >>>>>>>>>>>>>> casual DB access attempts - the db session is blanked out
>> >> in
>> >>>> the
>> >>>>>>>>>>>> parsing
>> >>>>>>>>>>>>>> process , as are the env var configs
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L274-L316
>> >>>>>>>>>>>>>> -
>> >>>>>>>>>>>>>> is this perfect no? but it’s much more than no protection
>> >>>>>>>>>>>>>> Oh absolutely.. This is exactly what we discussed back
>> >> then
>> >>> in
>> >>>>>>>>>> March
>> >>>>>>>>>>> I
>> >>>>>>>>>>>>>> think - and the way we decided to go for 3.0 with full
>> >>>> knowledge
>> >>>>>>>>>> it's
>> >>>>>>>>>>>> not
>> >>>>>>>>>>>>>> protecting against all threats.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Fri, Jul 25, 2025 at 8:22 AM Ash Berlin-Taylor <
>> >>>>>>>>>> [email protected]>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> A clarification to this - the dag parser today is likely
>> >>> not
>> >>>>>>>>>>>> protection
>> >>>>>>>>>>>>>>> against a dedicated malicious DAG author, but it does
>> >>> protect
>> >>>>>>>>>>> against
>> >>>>>>>>>>>>>>> casual DB access attempts - the db session is blanked out
>> >>> in
>> >>>>>>>>>> the
>> >>>>>>>>>>>>> parsing
>> >>>>>>>>>>>>>>> process , as are the env var configs
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L274-L316
>> >>>>>>>>>>>>>>> - is this perfect no? but it’s much more than no
>> >> protection
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On 24 Jul 2025, at 21:56, Jarek Potiuk <
>> >> [email protected]>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Currently in the DagFile processor there is no  built-in
>> >>>>>>>>>>> protection
>> >>>>>>>>>>>>>>> against
>> >>>>>>>>>>>>>>>> user code from Dag Parsing to - for example - read
>> >>> database
>> >>>>>>>>>>>>>>>> credentials from airflow configuration and use them to
>> >>> talk
>> >>>>>>>>>> to DB
>> >>>>>>>>>>>>>>> directly.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>

Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Reply via email to