BIG +1 to this proposal!

________________________________
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Friday, January 10, 2025 8:54:50 AM
To: dev@airflow.apache.org
Subject: RE: [EXT] [DISCUSS] Drop support for the DAG processor embedded in the 
scheduler

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
contenu ne présente aucun risque.



With Task API :)

On Fri, Jan 10, 2025 at 5:54 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Dag processor does not access the DB at all - it communicates with Task
> SDK to write the serialized DAGS.
>
> On Fri, Jan 10, 2025 at 4:51 PM Igor Kholopov <ikholo...@google.com.invalid>
> wrote:
>
>> The only thing that we need to clear out before this is sealed - what are
>> we going to do with SQLite? SQLite supports concurrent connections if the
>> processes are on the same host and we already have WAL enabled by Ash in
>> https://github.com/apache/airflow/pull/44839/files. But I think we still
>> need to do some additional smoke-testing to confirm that SQLite
>> performance
>> doesn't degrade drastically with dag-processor and scheduler writing to it
>> concurrently.
>>
>> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka <vik...@astronomer.io.invalid
>> >
>> wrote:
>>
>> > +1 on this
>> >
>> > For many reasons, which have already been brought up in the thread.
>> >
>> >
>> > On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov
>> <ikholo...@google.com.invalid
>> > >
>> > wrote:
>> >
>> > > +1, there are a lot of old code paths that exist only because of the
>> > > embedding in the scheduler support. Focusing on a single supported
>> mode
>> > of
>> > > operation will allow us to significantly reduce the size (and
>> complexity)
>> > > of the DAG processing code.
>> > >
>> > > On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <
>> pierrejb...@gmail.com>
>> > > wrote:
>> > >
>> > > > +1
>> > > >
>> > > > On Fri, Jan 10, 2025 at 11:01 AM Michał Modras
>> > > > <michalmod...@google.com.invalid> wrote:
>> > > >
>> > > > > +1 - separating these workloads makes sense to me - we remove
>> > > > > unnecessary coupling and make them more single-responsibility,
>> which
>> > > > eases
>> > > > > reasoning about the system and any potential debugging
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Yeah, purely from operational perspective, debugging issues
>> become
>> > > lots
>> > > > > > simpler if they are separated as one is CPU hungry while the
>> other
>> > is
>> > > > > > memory hungry!
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com>
>> > wrote:
>> > > > > >
>> > > > > > > +1 (or rather +10). There are two additional things - both
>> > related
>> > > to
>> > > > > > > security and our new security model.
>> > > > > > >
>> > > > > > > 1) separate standalone dag processor follows "secure by
>> design"
>> > > > > > principle.
>> > > > > > > Having scheduler and dag file processor sharing the same
>> > "process"
>> > > > > space
>> > > > > > is
>> > > > > > > a problem with isolation of DAG Author controlled security
>> > > perimeter
>> > > > > and
>> > > > > > > scheduler perimeter. While they were separate processes, it's
>> > just
>> > > > > > > inherently unsafe (from the security model perspective) to
>> have
>> > the
>> > > > DAG
>> > > > > > > processor started as a sub-process. And this is not an
>> "academic"
>> > > > case
>> > > > > -
>> > > > > > we
>> > > > > > > had a few security issues reported to us that could be only
>> > > exploited
>> > > > > if
>> > > > > > > airflow was run in the default mode, the issues were not
>> > > exploitable
>> > > > > when
>> > > > > > > standalone DAG Processor was used. And that is independent
>> from
>> > > point
>> > > > > 2)
>> > > > > > > regarding the database access - just running in the same
>> process
>> > > > space
>> > > > > > > allows DAG author to impact running scheduler code in various
>> > ways
>> > > > (via
>> > > > > > > temporary files for example - but there are multiple other
>> > > scenarios
>> > > > > and
>> > > > > > > attack vectors).
>> > > > > > >
>> > > > > > > 2) Currently the DAG processor has very different requirements
>> > than
>> > > > > > > scheduler when it comes to database access. Basically it MUST
>> NOT
>> > > > > connect
>> > > > > > > to the Airflow meta-database. We already saw failures
>> yesterday
>> > > after
>> > > > > > > merging bundle parsing that suggest that it was caused by
>> > > connection
>> > > > > > being
>> > > > > > > set in scheduler and DAG processor forked from it via
>> > > > multiprocessing -
>> > > > > > > originally we re-initialized database when we forked the
>> > processor,
>> > > > but
>> > > > > > now
>> > > > > > > DAG processor MUST NOT use the DB, so basically it looks like
>> we
>> > > leak
>> > > > > the
>> > > > > > > DB to the processor. Which is also yet another security issue
>> -
>> > if
>> > > > > > > scheduler has a way to initialize the database, it means it
>> has
>> > > > access
>> > > > > to
>> > > > > > > the database credentials, and it also means that unless we
>> > involve
>> > > > some
>> > > > > > > kind of cgroups docker-like process separation, such forked
>> DAG
>> > > > > processor
>> > > > > > > (and this also means DAG author) can access those credentials,
>> > and
>> > > > > access
>> > > > > > > database. This means pretty much that embedded DAG processor
>> > simply
>> > > > > > breaks
>> > > > > > > the "no DB access by DAG author" assumptions of Airflow 3.
>> > > > > > >
>> > > > > > > J.
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <
>> kaxiln...@gmail.com>
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > +1
>> > > > > > > >
>> > > > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham
>> > > > > <shu...@amazon.com.invalid
>> > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > + 1 on this as well. From what I have seen, standalone DAG
>> > > > > processing
>> > > > > > > > > results in a minor performance advantage and, importantly,
>> > > makes
>> > > > > the
>> > > > > > > > > Scheduler loop more resilient to DAG processor crashes.
>> > > > > > > > >
>> > > > > > > > > Shubham
>> > > > > > > > >
>> > > > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" <
>> > > > > > daniel.imber...@gmail.com
>> > > > > > > > > <mailto:daniel.imber...@gmail.com>> wrote:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > CAUTION: This email originated from outside of the
>> > > organization.
>> > > > Do
>> > > > > > not
>> > > > > > > > > click links or open attachments unless you can confirm the
>> > > sender
>> > > > > and
>> > > > > > > > know
>> > > > > > > > > the content is safe.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un
>> > > expéditeur
>> > > > > > > externe.
>> > > > > > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe
>> si
>> > > vous
>> > > > > ne
>> > > > > > > > pouvez
>> > > > > > > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes
>> > pas
>> > > > > > certain
>> > > > > > > > que
>> > > > > > > > > le contenu ne présente aucun risque.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm +1 on this.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > The fact that there's one more thing to deploy isn't that
>> big
>> > > of
>> > > > an
>> > > > > > > issue
>> > > > > > > > > given the number of pre-configurable options mentioned
>> (e.g.
>> > > > helm)
>> > > > > > and
>> > > > > > > a
>> > > > > > > > > full logical separation of DAG parsing and scheduling
>> makes
>> > > sense
>> > > > > > (one
>> > > > > > > > > thing that has been a longstanding issue with Airflow is
>> the
>> > > > > > scheduler
>> > > > > > > > > "Doing too many things", so it would be nice to create a
>> > clean
>> > > > > divide
>> > > > > > > > > here).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham
>> > > > > <j...@astronomer.io.inva
>> > > > > > > > > <mailto:j...@astronomer.io.inva>lid>
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Hello everyone!
>> > > > > > > > > >
>> > > > > > > > > > As I've been working on parsing lately, I want to
>> propose a
>> > > > > change
>> > > > > > in
>> > > > > > > > > that
>> > > > > > > > > > area in time for Airflow 3.
>> > > > > > > > > >
>> > > > > > > > > > Today there are 2 different ways the DAG processor can
>> be
>> > run
>> > > > in
>> > > > > > > > Airflow
>> > > > > > > > > -
>> > > > > > > > > > as a standalone component, or embedded in the scheduler.
>> > The
>> > > > > > > standalone
>> > > > > > > > > > option came in 2.3, prior to that the only option was it
>> > > being
>> > > > > > > embedded
>> > > > > > > > > in
>> > > > > > > > > > the scheduler.
>> > > > > > > > > >
>> > > > > > > > > > Why standalone? Generally speaking, parsing scales
>> > vertically
>> > > > > > (single
>> > > > > > > > > loop
>> > > > > > > > > > - more concurrent parsing) while scheduling is scaled
>> > > > > horizontally
>> > > > > > > > (many
>> > > > > > > > > > loops). As the DAG processor and scheduler scale in
>> > different
>> > > > > > > manners,
>> > > > > > > > > it's
>> > > > > > > > > > awkward to have them live in the same component. There
>> is
>> > > also
>> > > > a
>> > > > > > > > > resiliency
>> > > > > > > > > > aspect here, no noisy neighbor issues.
>> > > > > > > > > >
>> > > > > > > > > > Really, the only positive of the embedded option is that
>> > it's
>> > > > > > easier
>> > > > > > > to
>> > > > > > > > > > deploy, as there is 1 less component to worry about.
>> > However,
>> > > > we
>> > > > > > > > already
>> > > > > > > > > > have a number of components, so 1 more isn't that
>> > cumbersome.
>> > > > > > > Everyone
>> > > > > > > > > > using breeze, standalone, the helm chart, a vendor,
>> won't
>> > be
>> > > > > > impacted
>> > > > > > > > > much
>> > > > > > > > > > by this change - in fact, having the log stream separate
>> > is a
>> > > > big
>> > > > > > > > > positive!
>> > > > > > > > > >
>> > > > > > > > > > We'd also be able to remove a bit of complexity around
>> > > > > > > reinitialising a
>> > > > > > > > > > bunch of stuff in the child process.
>> > > > > > > > > >
>> > > > > > > > > > Overall, I see primarily positives with this change,
>> and a
>> > > > major
>> > > > > > > > version
>> > > > > > > > > > upgrade is the perfect time to simplify this part of
>> > Airflow.
>> > > > > > > Thoughts?
>> > > > > > > > > >
>> > > > > > > > > > Jed
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to