+1 (or rather +10). There are two additional things - both related to
security and our new security model.

1) separate standalone dag processor follows "secure by design" principle.
Having scheduler and dag file processor sharing the same "process" space is
a problem with isolation of DAG Author controlled security perimeter and
scheduler perimeter. While they were separate processes, it's just
inherently unsafe (from the security model perspective) to have the DAG
processor started as a sub-process. And this is not an "academic" case - we
had a few security issues reported to us that could be only exploited if
airflow was run in the default mode, the issues were not exploitable when
standalone DAG Processor was used. And that is independent from point 2)
regarding the database access - just running in the same process space
allows DAG author to impact running scheduler code in various ways (via
temporary files for example - but there are multiple other scenarios and
attack vectors).

2) Currently the DAG processor has very different requirements than
scheduler when it comes to database access. Basically it MUST NOT connect
to the Airflow meta-database. We already saw failures yesterday after
merging bundle parsing that suggest that it was caused by connection being
set in scheduler and DAG processor forked from it via multiprocessing -
originally we re-initialized database when we forked the processor, but now
DAG processor MUST NOT use the DB, so basically it looks like we leak the
DB to the processor. Which is also yet another security issue - if
scheduler has a way to initialize the database, it means it has access to
the database credentials, and it also means that unless we involve some
kind of cgroups docker-like process separation, such forked DAG processor
(and this also means DAG author) can access those credentials, and access
database. This means pretty much that embedded DAG processor simply breaks
the "no DB access by DAG author" assumptions of Airflow 3.

J.


On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com> wrote:

> +1
>
> On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham <shu...@amazon.com.invalid>
> wrote:
>
> > + 1 on this as well. From what I have seen, standalone DAG processing
> > results in a minor performance advantage and, importantly, makes the
> > Scheduler loop more resilient to DAG processor crashes.
> >
> > Shubham
> >
> > On 2025-01-09, 4:02 PM, "Daniel Imberman" <daniel.imber...@gmail.com
> > <mailto:daniel.imber...@gmail.com>> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> > le contenu ne présente aucun risque.
> >
> >
> >
> >
> >
> >
> > I'm +1 on this.
> >
> >
> > The fact that there's one more thing to deploy isn't that big of an issue
> > given the number of pre-configurable options mentioned (e.g. helm) and a
> > full logical separation of DAG parsing and scheduling makes sense (one
> > thing that has been a longstanding issue with Airflow is the scheduler
> > "Doing too many things", so it would be nice to create a clean divide
> > here).
> >
> >
> > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham <j...@astronomer.io.inva
> > <mailto:j...@astronomer.io.inva>lid>
> > wrote:
> >
> >
> > > Hello everyone!
> > >
> > > As I've been working on parsing lately, I want to propose a change in
> > that
> > > area in time for Airflow 3.
> > >
> > > Today there are 2 different ways the DAG processor can be run in
> Airflow
> > -
> > > as a standalone component, or embedded in the scheduler. The standalone
> > > option came in 2.3, prior to that the only option was it being embedded
> > in
> > > the scheduler.
> > >
> > > Why standalone? Generally speaking, parsing scales vertically (single
> > loop
> > > - more concurrent parsing) while scheduling is scaled horizontally
> (many
> > > loops). As the DAG processor and scheduler scale in different manners,
> > it's
> > > awkward to have them live in the same component. There is also a
> > resiliency
> > > aspect here, no noisy neighbor issues.
> > >
> > > Really, the only positive of the embedded option is that it's easier to
> > > deploy, as there is 1 less component to worry about. However, we
> already
> > > have a number of components, so 1 more isn't that cumbersome. Everyone
> > > using breeze, standalone, the helm chart, a vendor, won't be impacted
> > much
> > > by this change - in fact, having the log stream separate is a big
> > positive!
> > >
> > > We'd also be able to remove a bit of complexity around reinitialising a
> > > bunch of stuff in the child process.
> > >
> > > Overall, I see primarily positives with this change, and a major
> version
> > > upgrade is the perfect time to simplify this part of Airflow. Thoughts?
> > >
> > > Jed
> > >
> >
> >
> >
> >
>

Reply via email to