Thanks everyone, lazy consensus here: https://lists.apache.org/thread/cx1mg5g1pw4k1km56hzr06d0o39kdk8d
On Fri, Jan 10, 2025 at 2:08 PM Jed Cunningham <j...@astronomer.io> wrote: > Small clarification: the DAG processor manager still has access to the db > - but the DagFileProcessorProcess (eventually) doesn't, and uses the SDK > machinery to communicate with the DagFileProcessorManager. The manager acts > as its own "api server" in a way. > > Either way though, in the SQLite context, it's really no different running > in the scheduler or alongside it. Any performance problems that would exist > should already exist. And, I also don't think we should let performance > issues for SQLite have too big of a say in what we do (unless it's > unusable) - it is for development only after all. > > On Fri, Jan 10, 2025 at 1:41 PM Jens Scheffler <j_scheff...@gmx.de.invalid> > wrote: > >> +1 - as a very common problem in #user-troubleshooting is that DAGs are >> vanishing because of poor DB performance or (mostly) poorly written >> DAGs. If Performance problems appear in DAG parsing, separating the >> parser is the first hand fix. So we should have this per default. >> (Besides the case I hope at some point we come with the new DAG Bundle >> to a point where performance problems in DAG parsing do not make DAGs >> dis-appearing just because the re-parsing cycle is taking too long) >> >> On 10.01.25 17:54, Vincent Beck wrote: >> > +1 >> > >> > On 2025/01/10 15:50:55 Igor Kholopov wrote: >> >> The only thing that we need to clear out before this is sealed - what >> are >> >> we going to do with SQLite? SQLite supports concurrent connections if >> the >> >> processes are on the same host and we already have WAL enabled by Ash >> in >> >> https://github.com/apache/airflow/pull/44839/files. But I think we >> still >> >> need to do some additional smoke-testing to confirm that SQLite >> performance >> >> doesn't degrade drastically with dag-processor and scheduler writing >> to it >> >> concurrently. >> >> >> >> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka >> <vik...@astronomer.io.invalid> >> >> wrote: >> >> >> >>> +1 on this >> >>> >> >>> For many reasons, which have already been brought up in the thread. >> >>> >> >>> >> >>> On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov >> <ikholo...@google.com.invalid >> >>> wrote: >> >>> >> >>>> +1, there are a lot of old code paths that exist only because of the >> >>>> embedding in the scheduler support. Focusing on a single supported >> mode >> >>> of >> >>>> operation will allow us to significantly reduce the size (and >> complexity) >> >>>> of the DAG processing code. >> >>>> >> >>>> On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun < >> pierrejb...@gmail.com> >> >>>> wrote: >> >>>> >> >>>>> +1 >> >>>>> >> >>>>> On Fri, Jan 10, 2025 at 11:01 AM Michał Modras >> >>>>> <michalmod...@google.com.invalid> wrote: >> >>>>> >> >>>>>> +1 - separating these workloads makes sense to me - we remove >> >>>>>> unnecessary coupling and make them more single-responsibility, >> which >> >>>>> eases >> >>>>>> reasoning about the system and any potential debugging >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com> >> >>>> wrote: >> >>>>>>> Yeah, purely from operational perspective, debugging issues become >> >>>> lots >> >>>>>>> simpler if they are separated as one is CPU hungry while the other >> >>> is >> >>>>>>> memory hungry! >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com> >> >>> wrote: >> >>>>>>>> +1 (or rather +10). There are two additional things - both >> >>> related >> >>>> to >> >>>>>>>> security and our new security model. >> >>>>>>>> >> >>>>>>>> 1) separate standalone dag processor follows "secure by design" >> >>>>>>> principle. >> >>>>>>>> Having scheduler and dag file processor sharing the same >> >>> "process" >> >>>>>> space >> >>>>>>> is >> >>>>>>>> a problem with isolation of DAG Author controlled security >> >>>> perimeter >> >>>>>> and >> >>>>>>>> scheduler perimeter. While they were separate processes, it's >> >>> just >> >>>>>>>> inherently unsafe (from the security model perspective) to have >> >>> the >> >>>>> DAG >> >>>>>>>> processor started as a sub-process. And this is not an "academic" >> >>>>> case >> >>>>>> - >> >>>>>>> we >> >>>>>>>> had a few security issues reported to us that could be only >> >>>> exploited >> >>>>>> if >> >>>>>>>> airflow was run in the default mode, the issues were not >> >>>> exploitable >> >>>>>> when >> >>>>>>>> standalone DAG Processor was used. And that is independent from >> >>>> point >> >>>>>> 2) >> >>>>>>>> regarding the database access - just running in the same process >> >>>>> space >> >>>>>>>> allows DAG author to impact running scheduler code in various >> >>> ways >> >>>>> (via >> >>>>>>>> temporary files for example - but there are multiple other >> >>>> scenarios >> >>>>>> and >> >>>>>>>> attack vectors). >> >>>>>>>> >> >>>>>>>> 2) Currently the DAG processor has very different requirements >> >>> than >> >>>>>>>> scheduler when it comes to database access. Basically it MUST NOT >> >>>>>> connect >> >>>>>>>> to the Airflow meta-database. We already saw failures yesterday >> >>>> after >> >>>>>>>> merging bundle parsing that suggest that it was caused by >> >>>> connection >> >>>>>>> being >> >>>>>>>> set in scheduler and DAG processor forked from it via >> >>>>> multiprocessing - >> >>>>>>>> originally we re-initialized database when we forked the >> >>> processor, >> >>>>> but >> >>>>>>> now >> >>>>>>>> DAG processor MUST NOT use the DB, so basically it looks like we >> >>>> leak >> >>>>>> the >> >>>>>>>> DB to the processor. Which is also yet another security issue - >> >>> if >> >>>>>>>> scheduler has a way to initialize the database, it means it has >> >>>>> access >> >>>>>> to >> >>>>>>>> the database credentials, and it also means that unless we >> >>> involve >> >>>>> some >> >>>>>>>> kind of cgroups docker-like process separation, such forked DAG >> >>>>>> processor >> >>>>>>>> (and this also means DAG author) can access those credentials, >> >>> and >> >>>>>> access >> >>>>>>>> database. This means pretty much that embedded DAG processor >> >>> simply >> >>>>>>> breaks >> >>>>>>>> the "no DB access by DAG author" assumptions of Airflow 3. >> >>>>>>>> >> >>>>>>>> J. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com> >> >>>>>> wrote: >> >>>>>>>>> +1 >> >>>>>>>>> >> >>>>>>>>> On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham >> >>>>>> <shu...@amazon.com.invalid >> >>>>>>>>> wrote: >> >>>>>>>>> >> >>>>>>>>>> + 1 on this as well. From what I have seen, standalone DAG >> >>>>>> processing >> >>>>>>>>>> results in a minor performance advantage and, importantly, >> >>>> makes >> >>>>>> the >> >>>>>>>>>> Scheduler loop more resilient to DAG processor crashes. >> >>>>>>>>>> >> >>>>>>>>>> Shubham >> >>>>>>>>>> >> >>>>>>>>>> On 2025-01-09, 4:02 PM, "Daniel Imberman" < >> >>>>>>> daniel.imber...@gmail.com >> >>>>>>>>>> <mailto:daniel.imber...@gmail.com>> wrote: >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> CAUTION: This email originated from outside of the >> >>>> organization. >> >>>>> Do >> >>>>>>> not >> >>>>>>>>>> click links or open attachments unless you can confirm the >> >>>> sender >> >>>>>> and >> >>>>>>>>> know >> >>>>>>>>>> the content is safe. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> AVERTISSEMENT: Ce courrier électronique provient d’un >> >>>> expéditeur >> >>>>>>>> externe. >> >>>>>>>>>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si >> >>>> vous >> >>>>>> ne >> >>>>>>>>> pouvez >> >>>>>>>>>> pas confirmer l’identité de l’expéditeur et si vous n’êtes >> >>> pas >> >>>>>>> certain >> >>>>>>>>> que >> >>>>>>>>>> le contenu ne présente aucun risque. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> I'm +1 on this. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> The fact that there's one more thing to deploy isn't that big >> >>>> of >> >>>>> an >> >>>>>>>> issue >> >>>>>>>>>> given the number of pre-configurable options mentioned (e.g. >> >>>>> helm) >> >>>>>>> and >> >>>>>>>> a >> >>>>>>>>>> full logical separation of DAG parsing and scheduling makes >> >>>> sense >> >>>>>>> (one >> >>>>>>>>>> thing that has been a longstanding issue with Airflow is the >> >>>>>>> scheduler >> >>>>>>>>>> "Doing too many things", so it would be nice to create a >> >>> clean >> >>>>>> divide >> >>>>>>>>>> here). >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham >> >>>>>> <j...@astronomer.io.inva >> >>>>>>>>>> <mailto:j...@astronomer.io.inva>lid> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> Hello everyone! >> >>>>>>>>>>> >> >>>>>>>>>>> As I've been working on parsing lately, I want to propose a >> >>>>>> change >> >>>>>>> in >> >>>>>>>>>> that >> >>>>>>>>>>> area in time for Airflow 3. >> >>>>>>>>>>> >> >>>>>>>>>>> Today there are 2 different ways the DAG processor can be >> >>> run >> >>>>> in >> >>>>>>>>> Airflow >> >>>>>>>>>> - >> >>>>>>>>>>> as a standalone component, or embedded in the scheduler. >> >>> The >> >>>>>>>> standalone >> >>>>>>>>>>> option came in 2.3, prior to that the only option was it >> >>>> being >> >>>>>>>> embedded >> >>>>>>>>>> in >> >>>>>>>>>>> the scheduler. >> >>>>>>>>>>> >> >>>>>>>>>>> Why standalone? Generally speaking, parsing scales >> >>> vertically >> >>>>>>> (single >> >>>>>>>>>> loop >> >>>>>>>>>>> - more concurrent parsing) while scheduling is scaled >> >>>>>> horizontally >> >>>>>>>>> (many >> >>>>>>>>>>> loops). As the DAG processor and scheduler scale in >> >>> different >> >>>>>>>> manners, >> >>>>>>>>>> it's >> >>>>>>>>>>> awkward to have them live in the same component. There is >> >>>> also >> >>>>> a >> >>>>>>>>>> resiliency >> >>>>>>>>>>> aspect here, no noisy neighbor issues. >> >>>>>>>>>>> >> >>>>>>>>>>> Really, the only positive of the embedded option is that >> >>> it's >> >>>>>>> easier >> >>>>>>>> to >> >>>>>>>>>>> deploy, as there is 1 less component to worry about. >> >>> However, >> >>>>> we >> >>>>>>>>> already >> >>>>>>>>>>> have a number of components, so 1 more isn't that >> >>> cumbersome. >> >>>>>>>> Everyone >> >>>>>>>>>>> using breeze, standalone, the helm chart, a vendor, won't >> >>> be >> >>>>>>> impacted >> >>>>>>>>>> much >> >>>>>>>>>>> by this change - in fact, having the log stream separate >> >>> is a >> >>>>> big >> >>>>>>>>>> positive! >> >>>>>>>>>>> We'd also be able to remove a bit of complexity around >> >>>>>>>> reinitialising a >> >>>>>>>>>>> bunch of stuff in the child process. >> >>>>>>>>>>> >> >>>>>>>>>>> Overall, I see primarily positives with this change, and a >> >>>>> major >> >>>>>>>>> version >> >>>>>>>>>>> upgrade is the perfect time to simplify this part of >> >>> Airflow. >> >>>>>>>> Thoughts? >> >>>>>>>>>>> Jed >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> > For additional commands, e-mail: dev-h...@airflow.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> For additional commands, e-mail: dev-h...@airflow.apache.org >> >>