Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Sumit Maheshwari Mon, 04 Aug 2025 00:10:27 -0700

>
> My main concern with this right now is the serialisation format of DAGs —
> it wasn’t really designed with remote submission in mind, so it need some
> careful examination to see if it is fit for this purpose or not.
>


I'm not sure on this point, cause if we are able to convert a DAG into
JSON, then it has to be transferable over the internet.

In particular One of the things I worry about is that the JSON can get huge
> — I’ve seem this as large as 10-20Mb for some dags


Yeah, agree on this, thats why we can transfer compressed data instead of
real json. Of course, this won't guarantee that the payload will always be
small enough, but we can't say that it'll definitely happen either.

I also wonder if as part of this proposal we should move the Callback
> requests off the dag parsers and on to the workers instead

let's make such a "workfload" implementation stream that could support both
> - Deadlines and DAG parsing logic


I don't have any strong opinion here, but it feels like it's gonna blow up
the scope of the AIP too much.


On Fri, Aug 1, 2025 at 2:27 AM Jarek Potiuk <[email protected]> wrote:

> > My main concern with this right now is the serialisation format of DAGs —
> it wasn’t really designed with remote submission in mind, so it need some
> careful examination to see if it is fit for this purpose or not.
>
> Yep. That might be potentially a problem (or at least "need more resources
> to run airflow") and that is where my "2x memory" came from if we do it in
> a trivial way. Currently we a) keep the whole DAG in memory when
> serializing it b) submit it to database (also using essentially some kind
> of API (implemented by the database client) - so we know the whole thing
> "might work" but indeed if you use a trivial implementation of submitting
> the whole json - it basically means that the whole json will have to also
> be kept in the memory of API server. But we also compress it when needed -
> I wonder what are the compression ratios we saw with those 10-20MBs Dags -
> if the problem is using strings where bool would suffice, compression
> should generally help a lot. We could only ever send compressed data over
> the API - there seems to be no need to send "plain JSON" data over the API
> or storing the plain JSON in the DB (of course that trades memory for CPU).
>
> I wonder if sqlalchemy 2 (and drivers for MySQL/Postgres) have support for
> any kind if binary data streaming - because that could help a lot of if we
> could use streaming HTTP API and chunk and append the binary chunks (when
> writing) - or read data in chunks ans stream them back via the API. That
> could seriously decrease the amount of memory needed by the API server to
> process such huge serialized dags.
>
> And yeah - I would also love the "execute task" to be implemented here -
> but I am not sure if this should be part of the same effort or maybe a
> separate implementation? That sounds very loosely coupled with DB
> isolation. And it seems a common theme - I think that would also make the
> sync Deadline alerts case that we discussed at the dev call today. I wonder
> if that should not be kind of parallel (let's make such a "workfload"
> implementation stream that could support both - Deadlines and DAG parsing
> logic. We have already two "users" for it and I really love the saying "if
> you want to make something reusable - make it usable first"  - seems like
> we might have good opportunity to make such workload implementation "doubly
> used"  from the beginning which would increase chances it will be
> "reusable" for other things as well :).
>
> J.
>
>
> On Thu, Jul 31, 2025 at 12:28 PM Ash Berlin-Taylor <[email protected]> wrote:
>
> > My main concern with this right now is the serialisation format of DAGs —
> > it wasn’t really designed with remote submission in mind, so it need some
> > careful examination to see if it is fit for this purpose or not.
> >
> > In particular One of the things I worry about is that the JSON can get
> > huge — I’ve seem this as large as 10-20Mb for some dags(!!) (which is
> > likely due to things being included as text when a bool might suffice,
> for
> > example) But I don’t think “just submit the existing JSON over an API”
> is a
> > good idea.
> >
> > I also wonder if as part of this proposal we should move the Callback
> > requests off the dag parsers and on to the workers instead — in AIP-72 we
> > introduced the concept of a Workload, with the only one existing right
> now
> > is “ExecuteTask”
> >
> https://github.com/apache/airflow/blob/8e1201c7713d5c677fa6f6d48bbd4f6903505f61/airflow-core/src/airflow/executors/workloads.py#L87-L88
> > — it might be time to finally move task and dag callbacks to the same
> thing
> > and make dag parsers only responsible for, well, parsing. :)
> >
> > These are all solvable problems, and this will be a great feature to
> have,
> > but we need to do some more thinking and planning first.
> >
> > -ash
> >
> > > On 31 Jul 2025, at 10:12, Sumit Maheshwari <[email protected]>
> > wrote:
> > >
> > > Gentle reminder for everyone to review the proposal.
> > >
> > > Updated link:
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-92+Isolate+DAG+processor%2C+Callback+processor%2C+and+Triggerer+from+core+services
> > >
> > > On Tue, Jul 29, 2025 at 4:37 PM Sumit Maheshwari <
> [email protected]
> > >
> > > wrote:
> > >
> > >> Thanks everyone for reviewing this AIP. As Jarek and others
> suggested, I
> > >> expanded the scope of this AIP and divided it into three phases. With
> > the
> > >> increased scope, the boundary line between this AIP and AIP-85 got a
> > little
> > >> thinner, but I believe these are still two different enhancements to
> > make.
> > >>
> > >>
> > >>
> > >> On Fri, Jul 25, 2025 at 10:51 PM Sumit Maheshwari <
> > [email protected]>
> > >> wrote:
> > >>
> > >>> Yeah, overall it makes sense to include Triggers as well to be part
> of
> > >>> this AIP and phase out the implementation. Though I didn't exclude
> > Triggers
> > >>> because "Uber" doesn't need that, I just thought of keeping the scope
> > of
> > >>> development small and achieving them, just like it was done in
> Airlfow
> > 3 by
> > >>> secluding only workers and not DAG-processor & Triggers.
> > >>>
> > >>> But if you think Triggers should be part of this AIP itself, then I
> can
> > >>> do that and include Triggers as well in it.
> > >>>
> > >>> On Fri, Jul 25, 2025 at 7:34 PM Jarek Potiuk <[email protected]>
> wrote:
> > >>>
> > >>>> I would very much prefer the architectural choices of this AIP are
> > based
> > >>>> on
> > >>>> "general public" needs rather than "Uber needs" even if Uber will be
> > >>>> implementing it - so from my point of view having Trigger separation
> > as
> > >>>> part of it is quite important.
> > >>>>
> > >>>> But that's not even this.
> > >>>>
> > >>>> We've been discussing for example for Deadlines (being implemented
> by
> > >>>> Dennis and Ramit   a possibility of short, notification-style
> > "deadlines"
> > >>>> to be send to triggerer for execution - this is well advanced now,
> and
> > >>>> whether you want it or not Dag-provided code might be serialized and
> > sent
> > >>>> to triggerer for execution. This is part of our "broader"
> > architectural
> > >>>> change where we treat "workers" and "triggerer" similarly as a
> general
> > >>>> executors of "sync" and "async" tasks respectively. That's where
> > Airflow
> > >>>> is
> > >>>> evolving towards - inevitably.
> > >>>>
> > >>>> But we can of course phase things in out for implementation - even
> if
> > AIP
> > >>>> should cover both, I think if the goal of the AIP and preamble is
> > about
> > >>>> separating "user code" from "database" as the main reason, it also
> > means
> > >>>> Triggerer if you ask me (from design point of view at least).
> > >>>>
> > >>>> Again implementation can be phased and even different people and
> teams
> > >>>> might work on those phases/pieces.
> > >>>>
> > >>>> J.
> > >>>>
> > >>>> On Fri, Jul 25, 2025 at 2:29 PM Sumit Maheshwari <
> > [email protected]
> > >>>>>
> > >>>> wrote:
> > >>>>
> > >>>>>>
> > >>>>>>> #2. Yeah, we would need something similar for triggerers as well,
> > >>>> but
> > >>>>>> that
> > >>>>>> can be done as part of a different AIP
> > >>>>>
> > >>>>>
> > >>>>> You won't achieve your goal of "true" isolation of user code if you
> > >>>> don't
> > >>>>>> do triggerer. I think if the goal is to achieve it - it should
> cover
> > >>>>> both.
> > >>>>>
> > >>>>>
> > >>>>> My bad, should've explained our architecture for triggers as well,
> > >>>>> apologies. So here it is:
> > >>>>>
> > >>>>>
> > >>>>>   - Triggers would be running on a centralized service, so all the
> > >>>> Trigger
> > >>>>>   classes will be part of the platform team's repo and not the
> > >>>> customer's
> > >>>>> repo
> > >>>>>   - The triggers won't be able to use any libs other than std ones,
> > >>>> which
> > >>>>>   are being used in core Airflow (like requests, etc)
> > >>>>>   - As we are the owners of the core Airflow repo, customers have
> to
> > >>>> get
> > >>>>>   our approval to land any class in this path (unlike the dags repo
> > >>>> which
> > >>>>>   they own)
> > >>>>>   - When a customer's task defer, we would have an allowlist on our
> > >>>> side
> > >>>>>   to check if we should do the async polling or not
> > >>>>>   - If the Trigger class isn't part of our repo (allowlist), just
> > >>>> fail the
> > >>>>>   task, as anyway we won't be having the code that they used in the
> > >>>>> trigger
> > >>>>>   class
> > >>>>>   - If any of these conditions aren't suitable for you (as a
> > >>>> customer),
> > >>>>>   feel free to use sync tasks only
> > >>>>>
> > >>>>>
> > >>>>> But in general, I agree to make triggerer svc also communicate over
> > >>>> apis
> > >>>>> only. If that is done, then we can have instances of triggerer svc
> > >>>> running
> > >>>>> at customer's side as well, which can process any type of trigger
> > >>>> class.
> > >>>>> Though that's not a blocker for us at the moment, cause triggerer
> are
> > >>>>> mostly doing just polling using simple libs like requests.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Jul 25, 2025 at 5:03 PM Igor Kholopov
> > >>>> <[email protected]
> > >>>>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks Sumit for the detailed proposal. Overall I believe it
> aligns
> > >>>> well
> > >>>>>> with the goals of making Airflow well-scalable beyond a
> single-team
> > >>>>>> deployment (and AIP-85 goals), so you have my full support with
> this
> > >>>> one.
> > >>>>>>
> > >>>>>> I've left a couple of clarification requests on the AIP page.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Igor
> > >>>>>>
> > >>>>>> On Fri, Jul 25, 2025 at 11:50 AM Sumit Maheshwari <
> > >>>>> [email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Thanks Jarek and Ash, for the initial review. It's good to know
> > >>>> that
> > >>>>> the
> > >>>>>>> DAG processor has some preemptive measures in place to prevent
> > >>>> access
> > >>>>>>> to the DB. However, the main issue we are trying to solve is not
> to
> > >>>>>> provide
> > >>>>>>> DB creds to the customer teams, who are using Airflow as a
> > >>>> multi-tenant
> > >>>>>>> orchestration platform. I've updated the doc to reflect this
> point
> > >>>> as
> > >>>>>> well.
> > >>>>>>>
> > >>>>>>> Answering Jarek's points,
> > >>>>>>>
> > >>>>>>> #1. Yeah, had forgot to write about token mechanism, added that
> in
> > >>>> doc,
> > >>>>>> but
> > >>>>>>> still how the token can be obtained (safely) is still open in my
> > >>>> mind.
> > >>>>> I
> > >>>>>>> believe the token used by task executors can be created outside
> of
> > >>>> it
> > >>>>> as
> > >>>>>>> well (I may be wrong here).
> > >>>>>>>
> > >>>>>>> #2. Yeah, we would need something similar for triggerers as well,
> > >>>> but
> > >>>>>> that
> > >>>>>>> can be done as part of a different AIP
> > >>>>>>>
> > >>>>>>> #3. Yeah, I also believe the API should work largely.
> > >>>>>>>
> > >>>>>>> #4. Added that in the AIP, that instead of dag_dirs we can work
> > >>>> with
> > >>>>>>> dag_bundles and every dag-processor instance would be treated as
> a
> > >>>> diff
> > >>>>>>> bundle.
> > >>>>>>>
> > >>>>>>> Also, added points around callbacks, as these are also fetched
> > >>>> directly
> > >>>>>>> from the DB.
> > >>>>>>>
> > >>>>>>> On Fri, Jul 25, 2025 at 11:58 AM Jarek Potiuk <[email protected]>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>>> A clarification to this - the dag parser today is likely not
> > >>>>>> protection
> > >>>>>>>> against a dedicated malicious DAG author, but it does protect
> > >>>> against
> > >>>>>>>> casual DB access attempts - the db session is blanked out in the
> > >>>>>> parsing
> > >>>>>>>> process , as are the env var configs
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> >
> https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L274-L316
> > >>>>>>>> -
> > >>>>>>>> is this perfect no? but it’s much more than no protection
> > >>>>>>>> Oh absolutely.. This is exactly what we discussed back then in
> > >>>> March
> > >>>>> I
> > >>>>>>>> think - and the way we decided to go for 3.0 with full knowledge
> > >>>> it's
> > >>>>>> not
> > >>>>>>>> protecting against all threats.
> > >>>>>>>>
> > >>>>>>>> On Fri, Jul 25, 2025 at 8:22 AM Ash Berlin-Taylor <
> > >>>> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> A clarification to this - the dag parser today is likely not
> > >>>>>> protection
> > >>>>>>>>> against a dedicated malicious DAG author, but it does protect
> > >>>>> against
> > >>>>>>>>> casual DB access attempts - the db session is blanked out in
> > >>>> the
> > >>>>>>> parsing
> > >>>>>>>>> process , as are the env var configs
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> >
> https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L274-L316
> > >>>>>>>>> - is this perfect no? but it’s much more than no protection
> > >>>>>>>>>
> > >>>>>>>>>> On 24 Jul 2025, at 21:56, Jarek Potiuk <[email protected]>
> > >>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Currently in the DagFile processor there is no  built-in
> > >>>>> protection
> > >>>>>>>>> against
> > >>>>>>>>>> user code from Dag Parsing to - for example - read database
> > >>>>>>>>>> credentials from airflow configuration and use them to talk
> > >>>> to DB
> > >>>>>>>>> directly.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> >
>

Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Reply via email to