+1 on this For many reasons, which have already been brought up in the thread.
On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov <ikholo...@google.com.invalid> wrote: > +1, there are a lot of old code paths that exist only because of the > embedding in the scheduler support. Focusing on a single supported mode of > operation will allow us to significantly reduce the size (and complexity) > of the DAG processing code. > > On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <pierrejb...@gmail.com> > wrote: > > > +1 > > > > On Fri, Jan 10, 2025 at 11:01 AM Michał Modras > > <michalmod...@google.com.invalid> wrote: > > > > > +1 - separating these workloads makes sense to me - we remove > > > unnecessary coupling and make them more single-responsibility, which > > eases > > > reasoning about the system and any potential debugging > > > > > > > > > > > > On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com> > wrote: > > > > > > > Yeah, purely from operational perspective, debugging issues become > lots > > > > simpler if they are separated as one is CPU hungry while the other is > > > > memory hungry! > > > > > > > > > > > > > > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > > > > +1 (or rather +10). There are two additional things - both related > to > > > > > security and our new security model. > > > > > > > > > > 1) separate standalone dag processor follows "secure by design" > > > > principle. > > > > > Having scheduler and dag file processor sharing the same "process" > > > space > > > > is > > > > > a problem with isolation of DAG Author controlled security > perimeter > > > and > > > > > scheduler perimeter. While they were separate processes, it's just > > > > > inherently unsafe (from the security model perspective) to have the > > DAG > > > > > processor started as a sub-process. And this is not an "academic" > > case > > > - > > > > we > > > > > had a few security issues reported to us that could be only > exploited > > > if > > > > > airflow was run in the default mode, the issues were not > exploitable > > > when > > > > > standalone DAG Processor was used. And that is independent from > point > > > 2) > > > > > regarding the database access - just running in the same process > > space > > > > > allows DAG author to impact running scheduler code in various ways > > (via > > > > > temporary files for example - but there are multiple other > scenarios > > > and > > > > > attack vectors). > > > > > > > > > > 2) Currently the DAG processor has very different requirements than > > > > > scheduler when it comes to database access. Basically it MUST NOT > > > connect > > > > > to the Airflow meta-database. We already saw failures yesterday > after > > > > > merging bundle parsing that suggest that it was caused by > connection > > > > being > > > > > set in scheduler and DAG processor forked from it via > > multiprocessing - > > > > > originally we re-initialized database when we forked the processor, > > but > > > > now > > > > > DAG processor MUST NOT use the DB, so basically it looks like we > leak > > > the > > > > > DB to the processor. Which is also yet another security issue - if > > > > > scheduler has a way to initialize the database, it means it has > > access > > > to > > > > > the database credentials, and it also means that unless we involve > > some > > > > > kind of cgroups docker-like process separation, such forked DAG > > > processor > > > > > (and this also means DAG author) can access those credentials, and > > > access > > > > > database. This means pretty much that embedded DAG processor simply > > > > breaks > > > > > the "no DB access by DAG author" assumptions of Airflow 3. > > > > > > > > > > J. > > > > > > > > > > > > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com> > > > wrote: > > > > > > > > > > > +1 > > > > > > > > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham > > > <shu...@amazon.com.invalid > > > > > > > > > > > wrote: > > > > > > > > > > > > > + 1 on this as well. From what I have seen, standalone DAG > > > processing > > > > > > > results in a minor performance advantage and, importantly, > makes > > > the > > > > > > > Scheduler loop more resilient to DAG processor crashes. > > > > > > > > > > > > > > Shubham > > > > > > > > > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" < > > > > daniel.imber...@gmail.com > > > > > > > <mailto:daniel.imber...@gmail.com>> wrote: > > > > > > > > > > > > > > > > > > > > > CAUTION: This email originated from outside of the > organization. > > Do > > > > not > > > > > > > click links or open attachments unless you can confirm the > sender > > > and > > > > > > know > > > > > > > the content is safe. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un > expéditeur > > > > > externe. > > > > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si > vous > > > ne > > > > > > pouvez > > > > > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas > > > > certain > > > > > > que > > > > > > > le contenu ne présente aucun risque. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm +1 on this. > > > > > > > > > > > > > > > > > > > > > The fact that there's one more thing to deploy isn't that big > of > > an > > > > > issue > > > > > > > given the number of pre-configurable options mentioned (e.g. > > helm) > > > > and > > > > > a > > > > > > > full logical separation of DAG parsing and scheduling makes > sense > > > > (one > > > > > > > thing that has been a longstanding issue with Airflow is the > > > > scheduler > > > > > > > "Doing too many things", so it would be nice to create a clean > > > divide > > > > > > > here). > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham > > > <j...@astronomer.io.inva > > > > > > > <mailto:j...@astronomer.io.inva>lid> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > Hello everyone! > > > > > > > > > > > > > > > > As I've been working on parsing lately, I want to propose a > > > change > > > > in > > > > > > > that > > > > > > > > area in time for Airflow 3. > > > > > > > > > > > > > > > > Today there are 2 different ways the DAG processor can be run > > in > > > > > > Airflow > > > > > > > - > > > > > > > > as a standalone component, or embedded in the scheduler. The > > > > > standalone > > > > > > > > option came in 2.3, prior to that the only option was it > being > > > > > embedded > > > > > > > in > > > > > > > > the scheduler. > > > > > > > > > > > > > > > > Why standalone? Generally speaking, parsing scales vertically > > > > (single > > > > > > > loop > > > > > > > > - more concurrent parsing) while scheduling is scaled > > > horizontally > > > > > > (many > > > > > > > > loops). As the DAG processor and scheduler scale in different > > > > > manners, > > > > > > > it's > > > > > > > > awkward to have them live in the same component. There is > also > > a > > > > > > > resiliency > > > > > > > > aspect here, no noisy neighbor issues. > > > > > > > > > > > > > > > > Really, the only positive of the embedded option is that it's > > > > easier > > > > > to > > > > > > > > deploy, as there is 1 less component to worry about. However, > > we > > > > > > already > > > > > > > > have a number of components, so 1 more isn't that cumbersome. > > > > > Everyone > > > > > > > > using breeze, standalone, the helm chart, a vendor, won't be > > > > impacted > > > > > > > much > > > > > > > > by this change - in fact, having the log stream separate is a > > big > > > > > > > positive! > > > > > > > > > > > > > > > > We'd also be able to remove a bit of complexity around > > > > > reinitialising a > > > > > > > > bunch of stuff in the child process. > > > > > > > > > > > > > > > > Overall, I see primarily positives with this change, and a > > major > > > > > > version > > > > > > > > upgrade is the perfect time to simplify this part of Airflow. > > > > > Thoughts? > > > > > > > > > > > > > > > > Jed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >