Re: [DISCUSS] Drop support for the DAG processor embedded in the scheduler

Jens Scheffler Fri, 10 Jan 2025 12:40:42 -0800

+1 - as a very common problem in #user-troubleshooting is that DAGs are
vanishing because of poor DB performance or (mostly) poorly written
DAGs. If Performance problems appear in DAG parsing, separating the
parser is the first hand fix. So we should have this per default.
(Besides the case I hope at some point we come with the new DAG Bundle
to a point where performance problems in DAG parsing do not make DAGs
dis-appearing just because the re-parsing cycle is taking too long)


On 10.01.25 17:54, Vincent Beck wrote:

+1

On 2025/01/10 15:50:55 Igor Kholopov wrote:

The only thing that we need to clear out before this is sealed - what are
we going to do with SQLite? SQLite supports concurrent connections if the
processes are on the same host and we already have WAL enabled by Ash in
https://github.com/apache/airflow/pull/44839/files. But I think we still
need to do some additional smoke-testing to confirm that SQLite performance
doesn't degrade drastically with dag-processor and scheduler writing to it
concurrently.

On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka <[email protected]>
wrote:

+1 on this

For many reasons, which have already been brought up in the thread.


On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov <[email protected]
wrote:

+1, there are a lot of old code paths that exist only because of the
embedding in the scheduler support. Focusing on a single supported mode

of

operation will allow us to significantly reduce the size (and complexity)
of the DAG processing code.

On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <[email protected]>
wrote:

+1

On Fri, Jan 10, 2025 at 11:01 AM Michał Modras
<[email protected]> wrote:

+1 - separating these workloads makes sense to me - we remove
unnecessary coupling and make them more single-responsibility, which

eases

reasoning about the system and any potential debugging



On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <[email protected]>

wrote:

Yeah, purely from operational perspective, debugging issues become

lots

simpler if they are separated as one is CPU hungry while the other

is

memory hungry!



On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <[email protected]>

wrote:

+1 (or rather +10). There are two additional things - both

related

to

security and our new security model.

1) separate standalone dag processor follows "secure by design"

principle.

Having scheduler and dag file processor sharing the same

"process"

space

is

a problem with isolation of DAG Author controlled security

perimeter

and

scheduler perimeter. While they were separate processes, it's

just

inherently unsafe (from the security model perspective) to have

the

DAG

processor started as a sub-process. And this is not an "academic"

case

we

had a few security issues reported to us that could be only

exploited

if

airflow was run in the default mode, the issues were not

exploitable

when

standalone DAG Processor was used. And that is independent from

point

2)

regarding the database access - just running in the same process

space

allows DAG author to impact running scheduler code in various

ways

(via

temporary files for example - but there are multiple other

scenarios

and

attack vectors).

2) Currently the DAG processor has very different requirements

than

scheduler when it comes to database access. Basically it MUST NOT

connect

to the Airflow meta-database. We already saw failures yesterday

after

merging bundle parsing that suggest that it was caused by

connection

being

set in scheduler and DAG processor forked from it via

multiprocessing -

originally we re-initialized database when we forked the

processor,

but

now

DAG processor MUST NOT use the DB, so basically it looks like we

leak

the

DB to the processor. Which is also yet another security issue -

if

scheduler has a way to initialize the database, it means it has

access

to

the database credentials, and it also means that unless we

involve

some

kind of cgroups docker-like process separation, such forked DAG

processor

(and this also means DAG author) can access those credentials,

and

access

database. This means pretty much that embedded DAG processor

simply

breaks

the "no DB access by DAG author" assumptions of Airflow 3.

J.


On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <[email protected]>

wrote:

+1

On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham

<[email protected]

wrote:

+ 1 on this as well. From what I have seen, standalone DAG

processing

results in a minor performance advantage and, importantly,

makes

the

Scheduler loop more resilient to DAG processor crashes.

Shubham

On 2025-01-09, 4:02 PM, "Daniel Imberman" <

[email protected]

<mailto:[email protected]>> wrote:


CAUTION: This email originated from outside of the

organization.

Do

not

click links or open attachments unless you can confirm the

sender

and

know

the content is safe.






AVERTISSEMENT: Ce courrier électronique provient d’un

expéditeur

externe.

Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si

vous

ne

pouvez

pas confirmer l’identité de l’expéditeur et si vous n’êtes

pas

certain

que

le contenu ne présente aucun risque.






I'm +1 on this.


The fact that there's one more thing to deploy isn't that big

of

an

issue

given the number of pre-configurable options mentioned (e.g.

helm)

and

full logical separation of DAG parsing and scheduling makes

sense

(one

thing that has been a longstanding issue with Airflow is the

scheduler

"Doing too many things", so it would be nice to create a

clean

divide

here).


On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham

<[email protected]

<mailto:[email protected]>lid>
wrote:

Hello everyone!

As I've been working on parsing lately, I want to propose a

change

in

that

area in time for Airflow 3.

Today there are 2 different ways the DAG processor can be

run

in

Airflow

as a standalone component, or embedded in the scheduler.

The

standalone

option came in 2.3, prior to that the only option was it

being

embedded

in

the scheduler.

Why standalone? Generally speaking, parsing scales

vertically

(single

loop

- more concurrent parsing) while scheduling is scaled

horizontally

(many

loops). As the DAG processor and scheduler scale in

different

manners,

it's

awkward to have them live in the same component. There is

also

resiliency

aspect here, no noisy neighbor issues.

Really, the only positive of the embedded option is that

it's

easier

to

deploy, as there is 1 less component to worry about.

However,

we

already

have a number of components, so 1 more isn't that

cumbersome.

Everyone

using breeze, standalone, the helm chart, a vendor, won't

be

impacted

much

by this change - in fact, having the log stream separate

is a

big

positive!

We'd also be able to remove a bit of complexity around

reinitialising a

bunch of stuff in the child process.

Overall, I see primarily positives with this change, and a

major

version

upgrade is the perfect time to simplify this part of

Airflow.

Thoughts?

Jed

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Drop support for the DAG processor embedded in the scheduler

Reply via email to