Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Jarek Potiuk Thu, 24 Jul 2025 13:56:22 -0700

In general - yes. That's something we discussed as a next step and "final"
step of true separation of "user code" from "airflow code".

Currently in the DagFile processor there is no  built-in protection against
user code from Dag Parsing to - for example - read database
credentials from airflow configuration and use them to talk to DB directly.
I raised the same questions at the dev calls some time ago when we
discussed task isolation - and from the discussion on one of the dev call
this was what we agreed we should have eventually and that was a conscious
choice to (for now) make sure that DB is not actually used in the parsing
process - and parsed serialized dags are passed still through
multiprocessing to a process that has DB access. The deliberate choice was
that 3.0 was only isolating "workers" and nothing else.

This will also make multi-team setup more isolated as well. And especially
if you would like to take it on, that would be cool :)

I would love to first hear feedback on the general concept without diving
in detail comments but my first few - very crude - comments are:

1) there should be a way to authenticate the Dag processor to the API
server (long living auth information). Currently Dag Processor uses DB
credentials to write to the DB (which is not isolated by definition) - but
when we open it as an API we should not allow "anyone" to post new
serialized dags. so there should be a mechanism where we authenticate "dag
processor". Don't have concrete proposal yet - but we need to figure it out

2) I am afraid - Dag processor is not enough. we also need to make a very
similar split in Triggerer. Triggerer is another component that has access
to both - user-code and database. This means "/triggerer" namespace as
well. This api should be used to update/retrieve serialized triggers in
Airflow DB.

3) I don't think performance is a "huge" issue. For POST - Currently DAG
processor performs UPDATE on serialized dag table with the same payload.
The worst case is that we need more api-servers to handle it because
instead of dag -> DB, the payload will be sent via api_server. It could
even be a dedicated api_server if needs be. Yep. It will likely require a
bit more resources - but I think it's just max 2x dag processor memory in
the worst case, which is a "fixed" increase (and likely we could leave the
option of **not** using API server in simple installation - more resources
will be needed only when "more isolation" is needed. I don't think we will
have problem with pagination of GET - there is absolutely no need to
retrieve full serialized dag in the GET method - we only need meta-data,
version information, hash of the serialized dag - but not the serialized
dag itself. That should be rather small especially if point 4).

4) the API should have "dag_bundle" or a set of dag_bundles as primary
"selection" criteria. That would also play very well with the multi-team
where you should have a separate dag processor for a group of dag_bundles
belonging to one team. This also means that the authentication in point 1)
should be covering access to specific dag bundles only - for example I
imagine that we could generate long-living private credentials  containing
a signed list of dag bundles that the dag processor should be allowed to
interact with.

J.

On Thu, Jul 24, 2025 at 7:30 PM Sumit Maheshwari <sumeet.ma...@gmail.com>
wrote:

> Hello everyone,
>
> I have created a draft version to separate the running of DAG processor
> from Airflow's core services and moved it closer to the execution side.
>
> Please review the proposal and provide your valuable feedback.
>
> PS: We are in the process of adopting Airflow3 from some in-house setup, so
> I might not have a full understanding of the latest Airflow concepts and
> some other nitty-gritties.
>
> Thanks,
> Sumit
>

Re: [DISCUSS] AIP-92 Isolate DAG parsing logic

Reply via email to