+1 - as a very common problem in #user-troubleshooting is that DAGs are
vanishing because of poor DB performance or (mostly) poorly written
DAGs. If Performance problems appear in DAG parsing, separating the
parser is the first hand fix. So we should have this per default.
(Besides the case I hope at some point we come with the new DAG Bundle
to a point where performance problems in DAG parsing do not make DAGs
dis-appearing just because the re-parsing cycle is taking too long)

On 10.01.25 17:54, Vincent Beck wrote:
+1

On 2025/01/10 15:50:55 Igor Kholopov wrote:
The only thing that we need to clear out before this is sealed - what are
we going to do with SQLite? SQLite supports concurrent connections if the
processes are on the same host and we already have WAL enabled by Ash in
https://github.com/apache/airflow/pull/44839/files. But I think we still
need to do some additional smoke-testing to confirm that SQLite performance
doesn't degrade drastically with dag-processor and scheduler writing to it
concurrently.

On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka <vik...@astronomer.io.invalid>
wrote:

+1 on this

For many reasons, which have already been brought up in the thread.


On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov <ikholo...@google.com.invalid
wrote:

+1, there are a lot of old code paths that exist only because of the
embedding in the scheduler support. Focusing on a single supported mode
of
operation will allow us to significantly reduce the size (and complexity)
of the DAG processing code.

On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <pierrejb...@gmail.com>
wrote:

+1

On Fri, Jan 10, 2025 at 11:01 AM Michał Modras
<michalmod...@google.com.invalid> wrote:

+1 - separating these workloads makes sense to me - we remove
unnecessary coupling and make them more single-responsibility, which
eases
reasoning about the system and any potential debugging



On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com>
wrote:
Yeah, purely from operational perspective, debugging issues become
lots
simpler if they are separated as one is CPU hungry while the other
is
memory hungry!



On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com>
wrote:
+1 (or rather +10). There are two additional things - both
related
to
security and our new security model.

1) separate standalone dag processor follows "secure by design"
principle.
Having scheduler and dag file processor sharing the same
"process"
space
is
a problem with isolation of DAG Author controlled security
perimeter
and
scheduler perimeter. While they were separate processes, it's
just
inherently unsafe (from the security model perspective) to have
the
DAG
processor started as a sub-process. And this is not an "academic"
case
-
we
had a few security issues reported to us that could be only
exploited
if
airflow was run in the default mode, the issues were not
exploitable
when
standalone DAG Processor was used. And that is independent from
point
2)
regarding the database access - just running in the same process
space
allows DAG author to impact running scheduler code in various
ways
(via
temporary files for example - but there are multiple other
scenarios
and
attack vectors).

2) Currently the DAG processor has very different requirements
than
scheduler when it comes to database access. Basically it MUST NOT
connect
to the Airflow meta-database. We already saw failures yesterday
after
merging bundle parsing that suggest that it was caused by
connection
being
set in scheduler and DAG processor forked from it via
multiprocessing -
originally we re-initialized database when we forked the
processor,
but
now
DAG processor MUST NOT use the DB, so basically it looks like we
leak
the
DB to the processor. Which is also yet another security issue -
if
scheduler has a way to initialize the database, it means it has
access
to
the database credentials, and it also means that unless we
involve
some
kind of cgroups docker-like process separation, such forked DAG
processor
(and this also means DAG author) can access those credentials,
and
access
database. This means pretty much that embedded DAG processor
simply
breaks
the "no DB access by DAG author" assumptions of Airflow 3.

J.


On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com>
wrote:
+1

On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham
<shu...@amazon.com.invalid
wrote:

+ 1 on this as well. From what I have seen, standalone DAG
processing
results in a minor performance advantage and, importantly,
makes
the
Scheduler loop more resilient to DAG processor crashes.

Shubham

On 2025-01-09, 4:02 PM, "Daniel Imberman" <
daniel.imber...@gmail.com
<mailto:daniel.imber...@gmail.com>> wrote:


CAUTION: This email originated from outside of the
organization.
Do
not
click links or open attachments unless you can confirm the
sender
and
know
the content is safe.






AVERTISSEMENT: Ce courrier électronique provient d’un
expéditeur
externe.
Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si
vous
ne
pouvez
pas confirmer l’identité de l’expéditeur et si vous n’êtes
pas
certain
que
le contenu ne présente aucun risque.






I'm +1 on this.


The fact that there's one more thing to deploy isn't that big
of
an
issue
given the number of pre-configurable options mentioned (e.g.
helm)
and
a
full logical separation of DAG parsing and scheduling makes
sense
(one
thing that has been a longstanding issue with Airflow is the
scheduler
"Doing too many things", so it would be nice to create a
clean
divide
here).


On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham
<j...@astronomer.io.inva
<mailto:j...@astronomer.io.inva>lid>
wrote:


Hello everyone!

As I've been working on parsing lately, I want to propose a
change
in
that
area in time for Airflow 3.

Today there are 2 different ways the DAG processor can be
run
in
Airflow
-
as a standalone component, or embedded in the scheduler.
The
standalone
option came in 2.3, prior to that the only option was it
being
embedded
in
the scheduler.

Why standalone? Generally speaking, parsing scales
vertically
(single
loop
- more concurrent parsing) while scheduling is scaled
horizontally
(many
loops). As the DAG processor and scheduler scale in
different
manners,
it's
awkward to have them live in the same component. There is
also
a
resiliency
aspect here, no noisy neighbor issues.

Really, the only positive of the embedded option is that
it's
easier
to
deploy, as there is 1 less component to worry about.
However,
we
already
have a number of components, so 1 more isn't that
cumbersome.
Everyone
using breeze, standalone, the helm chart, a vendor, won't
be
impacted
much
by this change - in fact, having the log stream separate
is a
big
positive!
We'd also be able to remove a bit of complexity around
reinitialising a
bunch of stuff in the child process.

Overall, I see primarily positives with this change, and a
major
version
upgrade is the perfect time to simplify this part of
Airflow.
Thoughts?
Jed




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Reply via email to