Re: [DISCUSS] Drop support for the DAG processor embedded in the scheduler

Jed Cunningham Fri, 10 Jan 2025 13:12:14 -0800

Thanks everyone, lazy consensus here:
https://lists.apache.org/thread/cx1mg5g1pw4k1km56hzr06d0o39kdk8d


On Fri, Jan 10, 2025 at 2:08 PM Jed Cunningham <[email protected]> wrote:

> Small clarification: the DAG processor manager still has access to the db
> - but the DagFileProcessorProcess (eventually) doesn't, and uses the SDK
> machinery to communicate with the DagFileProcessorManager. The manager acts
> as its own "api server" in a way.
>
> Either way though, in the SQLite context, it's really no different running
> in the scheduler or alongside it. Any performance problems that would exist
> should already exist. And, I also don't think we should let performance
> issues for SQLite have too big of a say in what we do (unless it's
> unusable) - it is for development only after all.
>
> On Fri, Jan 10, 2025 at 1:41 PM Jens Scheffler <[email protected]>
> wrote:
>
>> +1 - as a very common problem in #user-troubleshooting is that DAGs are
>> vanishing because of poor DB performance or (mostly) poorly written
>> DAGs. If Performance problems appear in DAG parsing, separating the
>> parser is the first hand fix. So we should have this per default.
>> (Besides the case I hope at some point we come with the new DAG Bundle
>> to a point where performance problems in DAG parsing do not make DAGs
>> dis-appearing just because the re-parsing cycle is taking too long)
>>
>> On 10.01.25 17:54, Vincent Beck wrote:
>> > +1
>> >
>> > On 2025/01/10 15:50:55 Igor Kholopov wrote:
>> >> The only thing that we need to clear out before this is sealed - what
>> are
>> >> we going to do with SQLite? SQLite supports concurrent connections if
>> the
>> >> processes are on the same host and we already have WAL enabled by Ash
>> in
>> >> https://github.com/apache/airflow/pull/44839/files. But I think we
>> still
>> >> need to do some additional smoke-testing to confirm that SQLite
>> performance
>> >> doesn't degrade drastically with dag-processor and scheduler writing
>> to it
>> >> concurrently.
>> >>
>> >> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka
>> <[email protected]>
>> >> wrote:
>> >>
>> >>> +1 on this
>> >>>
>> >>> For many reasons, which have already been brought up in the thread.
>> >>>
>> >>>
>> >>> On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov
>> <[email protected]
>> >>> wrote:
>> >>>
>> >>>> +1, there are a lot of old code paths that exist only because of the
>> >>>> embedding in the scheduler support. Focusing on a single supported
>> mode
>> >>> of
>> >>>> operation will allow us to significantly reduce the size (and
>> complexity)
>> >>>> of the DAG processing code.
>> >>>>
>> >>>> On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <
>> [email protected]>
>> >>>> wrote:
>> >>>>
>> >>>>> +1
>> >>>>>
>> >>>>> On Fri, Jan 10, 2025 at 11:01 AM Michał Modras
>> >>>>> <[email protected]> wrote:
>> >>>>>
>> >>>>>> +1 - separating these workloads makes sense to me - we remove
>> >>>>>> unnecessary coupling and make them more single-responsibility,
>> which
>> >>>>> eases
>> >>>>>> reasoning about the system and any potential debugging
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <[email protected]>
>> >>>> wrote:
>> >>>>>>> Yeah, purely from operational perspective, debugging issues become
>> >>>> lots
>> >>>>>>> simpler if they are separated as one is CPU hungry while the other
>> >>> is
>> >>>>>>> memory hungry!
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <[email protected]>
>> >>> wrote:
>> >>>>>>>> +1 (or rather +10). There are two additional things - both
>> >>> related
>> >>>> to
>> >>>>>>>> security and our new security model.
>> >>>>>>>>
>> >>>>>>>> 1) separate standalone dag processor follows "secure by design"
>> >>>>>>> principle.
>> >>>>>>>> Having scheduler and dag file processor sharing the same
>> >>> "process"
>> >>>>>> space
>> >>>>>>> is
>> >>>>>>>> a problem with isolation of DAG Author controlled security
>> >>>> perimeter
>> >>>>>> and
>> >>>>>>>> scheduler perimeter. While they were separate processes, it's
>> >>> just
>> >>>>>>>> inherently unsafe (from the security model perspective) to have
>> >>> the
>> >>>>> DAG
>> >>>>>>>> processor started as a sub-process. And this is not an "academic"
>> >>>>> case
>> >>>>>> -
>> >>>>>>> we
>> >>>>>>>> had a few security issues reported to us that could be only
>> >>>> exploited
>> >>>>>> if
>> >>>>>>>> airflow was run in the default mode, the issues were not
>> >>>> exploitable
>> >>>>>> when
>> >>>>>>>> standalone DAG Processor was used. And that is independent from
>> >>>> point
>> >>>>>> 2)
>> >>>>>>>> regarding the database access - just running in the same process
>> >>>>> space
>> >>>>>>>> allows DAG author to impact running scheduler code in various
>> >>> ways
>> >>>>> (via
>> >>>>>>>> temporary files for example - but there are multiple other
>> >>>> scenarios
>> >>>>>> and
>> >>>>>>>> attack vectors).
>> >>>>>>>>
>> >>>>>>>> 2) Currently the DAG processor has very different requirements
>> >>> than
>> >>>>>>>> scheduler when it comes to database access. Basically it MUST NOT
>> >>>>>> connect
>> >>>>>>>> to the Airflow meta-database. We already saw failures yesterday
>> >>>> after
>> >>>>>>>> merging bundle parsing that suggest that it was caused by
>> >>>> connection
>> >>>>>>> being
>> >>>>>>>> set in scheduler and DAG processor forked from it via
>> >>>>> multiprocessing -
>> >>>>>>>> originally we re-initialized database when we forked the
>> >>> processor,
>> >>>>> but
>> >>>>>>> now
>> >>>>>>>> DAG processor MUST NOT use the DB, so basically it looks like we
>> >>>> leak
>> >>>>>> the
>> >>>>>>>> DB to the processor. Which is also yet another security issue -
>> >>> if
>> >>>>>>>> scheduler has a way to initialize the database, it means it has
>> >>>>> access
>> >>>>>> to
>> >>>>>>>> the database credentials, and it also means that unless we
>> >>> involve
>> >>>>> some
>> >>>>>>>> kind of cgroups docker-like process separation, such forked DAG
>> >>>>>> processor
>> >>>>>>>> (and this also means DAG author) can access those credentials,
>> >>> and
>> >>>>>> access
>> >>>>>>>> database. This means pretty much that embedded DAG processor
>> >>> simply
>> >>>>>>> breaks
>> >>>>>>>> the "no DB access by DAG author" assumptions of Airflow 3.
>> >>>>>>>>
>> >>>>>>>> J.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <[email protected]>
>> >>>>>> wrote:
>> >>>>>>>>> +1
>> >>>>>>>>>
>> >>>>>>>>> On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham
>> >>>>>> <[email protected]
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> + 1 on this as well. From what I have seen, standalone DAG
>> >>>>>> processing
>> >>>>>>>>>> results in a minor performance advantage and, importantly,
>> >>>> makes
>> >>>>>> the
>> >>>>>>>>>> Scheduler loop more resilient to DAG processor crashes.
>> >>>>>>>>>>
>> >>>>>>>>>> Shubham
>> >>>>>>>>>>
>> >>>>>>>>>> On 2025-01-09, 4:02 PM, "Daniel Imberman" <
>> >>>>>>> [email protected]
>> >>>>>>>>>> <mailto:[email protected]>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> CAUTION: This email originated from outside of the
>> >>>> organization.
>> >>>>> Do
>> >>>>>>> not
>> >>>>>>>>>> click links or open attachments unless you can confirm the
>> >>>> sender
>> >>>>>> and
>> >>>>>>>>> know
>> >>>>>>>>>> the content is safe.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> AVERTISSEMENT: Ce courrier électronique provient d’un
>> >>>> expéditeur
>> >>>>>>>> externe.
>> >>>>>>>>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si
>> >>>> vous
>> >>>>>> ne
>> >>>>>>>>> pouvez
>> >>>>>>>>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes
>> >>> pas
>> >>>>>>> certain
>> >>>>>>>>> que
>> >>>>>>>>>> le contenu ne présente aucun risque.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I'm +1 on this.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> The fact that there's one more thing to deploy isn't that big
>> >>>> of
>> >>>>> an
>> >>>>>>>> issue
>> >>>>>>>>>> given the number of pre-configurable options mentioned (e.g.
>> >>>>> helm)
>> >>>>>>> and
>> >>>>>>>> a
>> >>>>>>>>>> full logical separation of DAG parsing and scheduling makes
>> >>>> sense
>> >>>>>>> (one
>> >>>>>>>>>> thing that has been a longstanding issue with Airflow is the
>> >>>>>>> scheduler
>> >>>>>>>>>> "Doing too many things", so it would be nice to create a
>> >>> clean
>> >>>>>> divide
>> >>>>>>>>>> here).
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham
>> >>>>>> <[email protected]
>> >>>>>>>>>> <mailto:[email protected]>lid>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> Hello everyone!
>> >>>>>>>>>>>
>> >>>>>>>>>>> As I've been working on parsing lately, I want to propose a
>> >>>>>> change
>> >>>>>>> in
>> >>>>>>>>>> that
>> >>>>>>>>>>> area in time for Airflow 3.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Today there are 2 different ways the DAG processor can be
>> >>> run
>> >>>>> in
>> >>>>>>>>> Airflow
>> >>>>>>>>>> -
>> >>>>>>>>>>> as a standalone component, or embedded in the scheduler.
>> >>> The
>> >>>>>>>> standalone
>> >>>>>>>>>>> option came in 2.3, prior to that the only option was it
>> >>>> being
>> >>>>>>>> embedded
>> >>>>>>>>>> in
>> >>>>>>>>>>> the scheduler.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Why standalone? Generally speaking, parsing scales
>> >>> vertically
>> >>>>>>> (single
>> >>>>>>>>>> loop
>> >>>>>>>>>>> - more concurrent parsing) while scheduling is scaled
>> >>>>>> horizontally
>> >>>>>>>>> (many
>> >>>>>>>>>>> loops). As the DAG processor and scheduler scale in
>> >>> different
>> >>>>>>>> manners,
>> >>>>>>>>>> it's
>> >>>>>>>>>>> awkward to have them live in the same component. There is
>> >>>> also
>> >>>>> a
>> >>>>>>>>>> resiliency
>> >>>>>>>>>>> aspect here, no noisy neighbor issues.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Really, the only positive of the embedded option is that
>> >>> it's
>> >>>>>>> easier
>> >>>>>>>> to
>> >>>>>>>>>>> deploy, as there is 1 less component to worry about.
>> >>> However,
>> >>>>> we
>> >>>>>>>>> already
>> >>>>>>>>>>> have a number of components, so 1 more isn't that
>> >>> cumbersome.
>> >>>>>>>> Everyone
>> >>>>>>>>>>> using breeze, standalone, the helm chart, a vendor, won't
>> >>> be
>> >>>>>>> impacted
>> >>>>>>>>>> much
>> >>>>>>>>>>> by this change - in fact, having the log stream separate
>> >>> is a
>> >>>>> big
>> >>>>>>>>>> positive!
>> >>>>>>>>>>> We'd also be able to remove a bit of complexity around
>> >>>>>>>> reinitialising a
>> >>>>>>>>>>> bunch of stuff in the child process.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Overall, I see primarily positives with this change, and a
>> >>>>> major
>> >>>>>>>>> version
>> >>>>>>>>>>> upgrade is the perfect time to simplify this part of
>> >>> Airflow.
>> >>>>>>>> Thoughts?
>> >>>>>>>>>>> Jed
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: [DISCUSS] Drop support for the DAG processor embedded in the scheduler

Reply via email to