Small clarification: the DAG processor manager still has access to the db -
but the DagFileProcessorProcess (eventually) doesn't, and uses the SDK
machinery to communicate with the DagFileProcessorManager. The manager acts
as its own "api server" in a way.

Either way though, in the SQLite context, it's really no different running
in the scheduler or alongside it. Any performance problems that would exist
should already exist. And, I also don't think we should let performance
issues for SQLite have too big of a say in what we do (unless it's
unusable) - it is for development only after all.

On Fri, Jan 10, 2025 at 1:41 PM Jens Scheffler <j_scheff...@gmx.de.invalid>
wrote:

> +1 - as a very common problem in #user-troubleshooting is that DAGs are
> vanishing because of poor DB performance or (mostly) poorly written
> DAGs. If Performance problems appear in DAG parsing, separating the
> parser is the first hand fix. So we should have this per default.
> (Besides the case I hope at some point we come with the new DAG Bundle
> to a point where performance problems in DAG parsing do not make DAGs
> dis-appearing just because the re-parsing cycle is taking too long)
>
> On 10.01.25 17:54, Vincent Beck wrote:
> > +1
> >
> > On 2025/01/10 15:50:55 Igor Kholopov wrote:
> >> The only thing that we need to clear out before this is sealed - what
> are
> >> we going to do with SQLite? SQLite supports concurrent connections if
> the
> >> processes are on the same host and we already have WAL enabled by Ash in
> >> https://github.com/apache/airflow/pull/44839/files. But I think we
> still
> >> need to do some additional smoke-testing to confirm that SQLite
> performance
> >> doesn't degrade drastically with dag-processor and scheduler writing to
> it
> >> concurrently.
> >>
> >> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka
> <vik...@astronomer.io.invalid>
> >> wrote:
> >>
> >>> +1 on this
> >>>
> >>> For many reasons, which have already been brought up in the thread.
> >>>
> >>>
> >>> On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov
> <ikholo...@google.com.invalid
> >>> wrote:
> >>>
> >>>> +1, there are a lot of old code paths that exist only because of the
> >>>> embedding in the scheduler support. Focusing on a single supported
> mode
> >>> of
> >>>> operation will allow us to significantly reduce the size (and
> complexity)
> >>>> of the DAG processing code.
> >>>>
> >>>> On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <
> pierrejb...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> +1
> >>>>>
> >>>>> On Fri, Jan 10, 2025 at 11:01 AM Michał Modras
> >>>>> <michalmod...@google.com.invalid> wrote:
> >>>>>
> >>>>>> +1 - separating these workloads makes sense to me - we remove
> >>>>>> unnecessary coupling and make them more single-responsibility, which
> >>>>> eases
> >>>>>> reasoning about the system and any potential debugging
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com>
> >>>> wrote:
> >>>>>>> Yeah, purely from operational perspective, debugging issues become
> >>>> lots
> >>>>>>> simpler if they are separated as one is CPU hungry while the other
> >>> is
> >>>>>>> memory hungry!
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com>
> >>> wrote:
> >>>>>>>> +1 (or rather +10). There are two additional things - both
> >>> related
> >>>> to
> >>>>>>>> security and our new security model.
> >>>>>>>>
> >>>>>>>> 1) separate standalone dag processor follows "secure by design"
> >>>>>>> principle.
> >>>>>>>> Having scheduler and dag file processor sharing the same
> >>> "process"
> >>>>>> space
> >>>>>>> is
> >>>>>>>> a problem with isolation of DAG Author controlled security
> >>>> perimeter
> >>>>>> and
> >>>>>>>> scheduler perimeter. While they were separate processes, it's
> >>> just
> >>>>>>>> inherently unsafe (from the security model perspective) to have
> >>> the
> >>>>> DAG
> >>>>>>>> processor started as a sub-process. And this is not an "academic"
> >>>>> case
> >>>>>> -
> >>>>>>> we
> >>>>>>>> had a few security issues reported to us that could be only
> >>>> exploited
> >>>>>> if
> >>>>>>>> airflow was run in the default mode, the issues were not
> >>>> exploitable
> >>>>>> when
> >>>>>>>> standalone DAG Processor was used. And that is independent from
> >>>> point
> >>>>>> 2)
> >>>>>>>> regarding the database access - just running in the same process
> >>>>> space
> >>>>>>>> allows DAG author to impact running scheduler code in various
> >>> ways
> >>>>> (via
> >>>>>>>> temporary files for example - but there are multiple other
> >>>> scenarios
> >>>>>> and
> >>>>>>>> attack vectors).
> >>>>>>>>
> >>>>>>>> 2) Currently the DAG processor has very different requirements
> >>> than
> >>>>>>>> scheduler when it comes to database access. Basically it MUST NOT
> >>>>>> connect
> >>>>>>>> to the Airflow meta-database. We already saw failures yesterday
> >>>> after
> >>>>>>>> merging bundle parsing that suggest that it was caused by
> >>>> connection
> >>>>>>> being
> >>>>>>>> set in scheduler and DAG processor forked from it via
> >>>>> multiprocessing -
> >>>>>>>> originally we re-initialized database when we forked the
> >>> processor,
> >>>>> but
> >>>>>>> now
> >>>>>>>> DAG processor MUST NOT use the DB, so basically it looks like we
> >>>> leak
> >>>>>> the
> >>>>>>>> DB to the processor. Which is also yet another security issue -
> >>> if
> >>>>>>>> scheduler has a way to initialize the database, it means it has
> >>>>> access
> >>>>>> to
> >>>>>>>> the database credentials, and it also means that unless we
> >>> involve
> >>>>> some
> >>>>>>>> kind of cgroups docker-like process separation, such forked DAG
> >>>>>> processor
> >>>>>>>> (and this also means DAG author) can access those credentials,
> >>> and
> >>>>>> access
> >>>>>>>> database. This means pretty much that embedded DAG processor
> >>> simply
> >>>>>>> breaks
> >>>>>>>> the "no DB access by DAG author" assumptions of Airflow 3.
> >>>>>>>>
> >>>>>>>> J.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>> +1
> >>>>>>>>>
> >>>>>>>>> On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham
> >>>>>> <shu...@amazon.com.invalid
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> + 1 on this as well. From what I have seen, standalone DAG
> >>>>>> processing
> >>>>>>>>>> results in a minor performance advantage and, importantly,
> >>>> makes
> >>>>>> the
> >>>>>>>>>> Scheduler loop more resilient to DAG processor crashes.
> >>>>>>>>>>
> >>>>>>>>>> Shubham
> >>>>>>>>>>
> >>>>>>>>>> On 2025-01-09, 4:02 PM, "Daniel Imberman" <
> >>>>>>> daniel.imber...@gmail.com
> >>>>>>>>>> <mailto:daniel.imber...@gmail.com>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> CAUTION: This email originated from outside of the
> >>>> organization.
> >>>>> Do
> >>>>>>> not
> >>>>>>>>>> click links or open attachments unless you can confirm the
> >>>> sender
> >>>>>> and
> >>>>>>>>> know
> >>>>>>>>>> the content is safe.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> AVERTISSEMENT: Ce courrier électronique provient d’un
> >>>> expéditeur
> >>>>>>>> externe.
> >>>>>>>>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si
> >>>> vous
> >>>>>> ne
> >>>>>>>>> pouvez
> >>>>>>>>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes
> >>> pas
> >>>>>>> certain
> >>>>>>>>> que
> >>>>>>>>>> le contenu ne présente aucun risque.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'm +1 on this.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The fact that there's one more thing to deploy isn't that big
> >>>> of
> >>>>> an
> >>>>>>>> issue
> >>>>>>>>>> given the number of pre-configurable options mentioned (e.g.
> >>>>> helm)
> >>>>>>> and
> >>>>>>>> a
> >>>>>>>>>> full logical separation of DAG parsing and scheduling makes
> >>>> sense
> >>>>>>> (one
> >>>>>>>>>> thing that has been a longstanding issue with Airflow is the
> >>>>>>> scheduler
> >>>>>>>>>> "Doing too many things", so it would be nice to create a
> >>> clean
> >>>>>> divide
> >>>>>>>>>> here).
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham
> >>>>>> <j...@astronomer.io.inva
> >>>>>>>>>> <mailto:j...@astronomer.io.inva>lid>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Hello everyone!
> >>>>>>>>>>>
> >>>>>>>>>>> As I've been working on parsing lately, I want to propose a
> >>>>>> change
> >>>>>>> in
> >>>>>>>>>> that
> >>>>>>>>>>> area in time for Airflow 3.
> >>>>>>>>>>>
> >>>>>>>>>>> Today there are 2 different ways the DAG processor can be
> >>> run
> >>>>> in
> >>>>>>>>> Airflow
> >>>>>>>>>> -
> >>>>>>>>>>> as a standalone component, or embedded in the scheduler.
> >>> The
> >>>>>>>> standalone
> >>>>>>>>>>> option came in 2.3, prior to that the only option was it
> >>>> being
> >>>>>>>> embedded
> >>>>>>>>>> in
> >>>>>>>>>>> the scheduler.
> >>>>>>>>>>>
> >>>>>>>>>>> Why standalone? Generally speaking, parsing scales
> >>> vertically
> >>>>>>> (single
> >>>>>>>>>> loop
> >>>>>>>>>>> - more concurrent parsing) while scheduling is scaled
> >>>>>> horizontally
> >>>>>>>>> (many
> >>>>>>>>>>> loops). As the DAG processor and scheduler scale in
> >>> different
> >>>>>>>> manners,
> >>>>>>>>>> it's
> >>>>>>>>>>> awkward to have them live in the same component. There is
> >>>> also
> >>>>> a
> >>>>>>>>>> resiliency
> >>>>>>>>>>> aspect here, no noisy neighbor issues.
> >>>>>>>>>>>
> >>>>>>>>>>> Really, the only positive of the embedded option is that
> >>> it's
> >>>>>>> easier
> >>>>>>>> to
> >>>>>>>>>>> deploy, as there is 1 less component to worry about.
> >>> However,
> >>>>> we
> >>>>>>>>> already
> >>>>>>>>>>> have a number of components, so 1 more isn't that
> >>> cumbersome.
> >>>>>>>> Everyone
> >>>>>>>>>>> using breeze, standalone, the helm chart, a vendor, won't
> >>> be
> >>>>>>> impacted
> >>>>>>>>>> much
> >>>>>>>>>>> by this change - in fact, having the log stream separate
> >>> is a
> >>>>> big
> >>>>>>>>>> positive!
> >>>>>>>>>>> We'd also be able to remove a bit of complexity around
> >>>>>>>> reinitialising a
> >>>>>>>>>>> bunch of stuff in the child process.
> >>>>>>>>>>>
> >>>>>>>>>>> Overall, I see primarily positives with this change, and a
> >>>>> major
> >>>>>>>>> version
> >>>>>>>>>>> upgrade is the perfect time to simplify this part of
> >>> Airflow.
> >>>>>>>> Thoughts?
> >>>>>>>>>>> Jed
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Reply via email to