Small clarification: the DAG processor manager still has access to the db - but the DagFileProcessorProcess (eventually) doesn't, and uses the SDK machinery to communicate with the DagFileProcessorManager. The manager acts as its own "api server" in a way.
Either way though, in the SQLite context, it's really no different running in the scheduler or alongside it. Any performance problems that would exist should already exist. And, I also don't think we should let performance issues for SQLite have too big of a say in what we do (unless it's unusable) - it is for development only after all. On Fri, Jan 10, 2025 at 1:41 PM Jens Scheffler <j_scheff...@gmx.de.invalid> wrote: > +1 - as a very common problem in #user-troubleshooting is that DAGs are > vanishing because of poor DB performance or (mostly) poorly written > DAGs. If Performance problems appear in DAG parsing, separating the > parser is the first hand fix. So we should have this per default. > (Besides the case I hope at some point we come with the new DAG Bundle > to a point where performance problems in DAG parsing do not make DAGs > dis-appearing just because the re-parsing cycle is taking too long) > > On 10.01.25 17:54, Vincent Beck wrote: > > +1 > > > > On 2025/01/10 15:50:55 Igor Kholopov wrote: > >> The only thing that we need to clear out before this is sealed - what > are > >> we going to do with SQLite? SQLite supports concurrent connections if > the > >> processes are on the same host and we already have WAL enabled by Ash in > >> https://github.com/apache/airflow/pull/44839/files. But I think we > still > >> need to do some additional smoke-testing to confirm that SQLite > performance > >> doesn't degrade drastically with dag-processor and scheduler writing to > it > >> concurrently. > >> > >> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka > <vik...@astronomer.io.invalid> > >> wrote: > >> > >>> +1 on this > >>> > >>> For many reasons, which have already been brought up in the thread. > >>> > >>> > >>> On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov > <ikholo...@google.com.invalid > >>> wrote: > >>> > >>>> +1, there are a lot of old code paths that exist only because of the > >>>> embedding in the scheduler support. Focusing on a single supported > mode > >>> of > >>>> operation will allow us to significantly reduce the size (and > complexity) > >>>> of the DAG processing code. > >>>> > >>>> On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun < > pierrejb...@gmail.com> > >>>> wrote: > >>>> > >>>>> +1 > >>>>> > >>>>> On Fri, Jan 10, 2025 at 11:01 AM Michał Modras > >>>>> <michalmod...@google.com.invalid> wrote: > >>>>> > >>>>>> +1 - separating these workloads makes sense to me - we remove > >>>>>> unnecessary coupling and make them more single-responsibility, which > >>>>> eases > >>>>>> reasoning about the system and any potential debugging > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com> > >>>> wrote: > >>>>>>> Yeah, purely from operational perspective, debugging issues become > >>>> lots > >>>>>>> simpler if they are separated as one is CPU hungry while the other > >>> is > >>>>>>> memory hungry! > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com> > >>> wrote: > >>>>>>>> +1 (or rather +10). There are two additional things - both > >>> related > >>>> to > >>>>>>>> security and our new security model. > >>>>>>>> > >>>>>>>> 1) separate standalone dag processor follows "secure by design" > >>>>>>> principle. > >>>>>>>> Having scheduler and dag file processor sharing the same > >>> "process" > >>>>>> space > >>>>>>> is > >>>>>>>> a problem with isolation of DAG Author controlled security > >>>> perimeter > >>>>>> and > >>>>>>>> scheduler perimeter. While they were separate processes, it's > >>> just > >>>>>>>> inherently unsafe (from the security model perspective) to have > >>> the > >>>>> DAG > >>>>>>>> processor started as a sub-process. And this is not an "academic" > >>>>> case > >>>>>> - > >>>>>>> we > >>>>>>>> had a few security issues reported to us that could be only > >>>> exploited > >>>>>> if > >>>>>>>> airflow was run in the default mode, the issues were not > >>>> exploitable > >>>>>> when > >>>>>>>> standalone DAG Processor was used. And that is independent from > >>>> point > >>>>>> 2) > >>>>>>>> regarding the database access - just running in the same process > >>>>> space > >>>>>>>> allows DAG author to impact running scheduler code in various > >>> ways > >>>>> (via > >>>>>>>> temporary files for example - but there are multiple other > >>>> scenarios > >>>>>> and > >>>>>>>> attack vectors). > >>>>>>>> > >>>>>>>> 2) Currently the DAG processor has very different requirements > >>> than > >>>>>>>> scheduler when it comes to database access. Basically it MUST NOT > >>>>>> connect > >>>>>>>> to the Airflow meta-database. We already saw failures yesterday > >>>> after > >>>>>>>> merging bundle parsing that suggest that it was caused by > >>>> connection > >>>>>>> being > >>>>>>>> set in scheduler and DAG processor forked from it via > >>>>> multiprocessing - > >>>>>>>> originally we re-initialized database when we forked the > >>> processor, > >>>>> but > >>>>>>> now > >>>>>>>> DAG processor MUST NOT use the DB, so basically it looks like we > >>>> leak > >>>>>> the > >>>>>>>> DB to the processor. Which is also yet another security issue - > >>> if > >>>>>>>> scheduler has a way to initialize the database, it means it has > >>>>> access > >>>>>> to > >>>>>>>> the database credentials, and it also means that unless we > >>> involve > >>>>> some > >>>>>>>> kind of cgroups docker-like process separation, such forked DAG > >>>>>> processor > >>>>>>>> (and this also means DAG author) can access those credentials, > >>> and > >>>>>> access > >>>>>>>> database. This means pretty much that embedded DAG processor > >>> simply > >>>>>>> breaks > >>>>>>>> the "no DB access by DAG author" assumptions of Airflow 3. > >>>>>>>> > >>>>>>>> J. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com> > >>>>>> wrote: > >>>>>>>>> +1 > >>>>>>>>> > >>>>>>>>> On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham > >>>>>> <shu...@amazon.com.invalid > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> + 1 on this as well. From what I have seen, standalone DAG > >>>>>> processing > >>>>>>>>>> results in a minor performance advantage and, importantly, > >>>> makes > >>>>>> the > >>>>>>>>>> Scheduler loop more resilient to DAG processor crashes. > >>>>>>>>>> > >>>>>>>>>> Shubham > >>>>>>>>>> > >>>>>>>>>> On 2025-01-09, 4:02 PM, "Daniel Imberman" < > >>>>>>> daniel.imber...@gmail.com > >>>>>>>>>> <mailto:daniel.imber...@gmail.com>> wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> CAUTION: This email originated from outside of the > >>>> organization. > >>>>> Do > >>>>>>> not > >>>>>>>>>> click links or open attachments unless you can confirm the > >>>> sender > >>>>>> and > >>>>>>>>> know > >>>>>>>>>> the content is safe. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> AVERTISSEMENT: Ce courrier électronique provient d’un > >>>> expéditeur > >>>>>>>> externe. > >>>>>>>>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si > >>>> vous > >>>>>> ne > >>>>>>>>> pouvez > >>>>>>>>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes > >>> pas > >>>>>>> certain > >>>>>>>>> que > >>>>>>>>>> le contenu ne présente aucun risque. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> I'm +1 on this. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> The fact that there's one more thing to deploy isn't that big > >>>> of > >>>>> an > >>>>>>>> issue > >>>>>>>>>> given the number of pre-configurable options mentioned (e.g. > >>>>> helm) > >>>>>>> and > >>>>>>>> a > >>>>>>>>>> full logical separation of DAG parsing and scheduling makes > >>>> sense > >>>>>>> (one > >>>>>>>>>> thing that has been a longstanding issue with Airflow is the > >>>>>>> scheduler > >>>>>>>>>> "Doing too many things", so it would be nice to create a > >>> clean > >>>>>> divide > >>>>>>>>>> here). > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham > >>>>>> <j...@astronomer.io.inva > >>>>>>>>>> <mailto:j...@astronomer.io.inva>lid> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Hello everyone! > >>>>>>>>>>> > >>>>>>>>>>> As I've been working on parsing lately, I want to propose a > >>>>>> change > >>>>>>> in > >>>>>>>>>> that > >>>>>>>>>>> area in time for Airflow 3. > >>>>>>>>>>> > >>>>>>>>>>> Today there are 2 different ways the DAG processor can be > >>> run > >>>>> in > >>>>>>>>> Airflow > >>>>>>>>>> - > >>>>>>>>>>> as a standalone component, or embedded in the scheduler. > >>> The > >>>>>>>> standalone > >>>>>>>>>>> option came in 2.3, prior to that the only option was it > >>>> being > >>>>>>>> embedded > >>>>>>>>>> in > >>>>>>>>>>> the scheduler. > >>>>>>>>>>> > >>>>>>>>>>> Why standalone? Generally speaking, parsing scales > >>> vertically > >>>>>>> (single > >>>>>>>>>> loop > >>>>>>>>>>> - more concurrent parsing) while scheduling is scaled > >>>>>> horizontally > >>>>>>>>> (many > >>>>>>>>>>> loops). As the DAG processor and scheduler scale in > >>> different > >>>>>>>> manners, > >>>>>>>>>> it's > >>>>>>>>>>> awkward to have them live in the same component. There is > >>>> also > >>>>> a > >>>>>>>>>> resiliency > >>>>>>>>>>> aspect here, no noisy neighbor issues. > >>>>>>>>>>> > >>>>>>>>>>> Really, the only positive of the embedded option is that > >>> it's > >>>>>>> easier > >>>>>>>> to > >>>>>>>>>>> deploy, as there is 1 less component to worry about. > >>> However, > >>>>> we > >>>>>>>>> already > >>>>>>>>>>> have a number of components, so 1 more isn't that > >>> cumbersome. > >>>>>>>> Everyone > >>>>>>>>>>> using breeze, standalone, the helm chart, a vendor, won't > >>> be > >>>>>>> impacted > >>>>>>>>>> much > >>>>>>>>>>> by this change - in fact, having the log stream separate > >>> is a > >>>>> big > >>>>>>>>>> positive! > >>>>>>>>>>> We'd also be able to remove a bit of complexity around > >>>>>>>> reinitialising a > >>>>>>>>>>> bunch of stuff in the child process. > >>>>>>>>>>> > >>>>>>>>>>> Overall, I see primarily positives with this change, and a > >>>>> major > >>>>>>>>> version > >>>>>>>>>>> upgrade is the perfect time to simplify this part of > >>> Airflow. > >>>>>>>> Thoughts? > >>>>>>>>>>> Jed > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > For additional commands, e-mail: dev-h...@airflow.apache.org > >