+1 (or rather +10). There are two additional things - both related to security and our new security model.
1) separate standalone dag processor follows "secure by design" principle. Having scheduler and dag file processor sharing the same "process" space is a problem with isolation of DAG Author controlled security perimeter and scheduler perimeter. While they were separate processes, it's just inherently unsafe (from the security model perspective) to have the DAG processor started as a sub-process. And this is not an "academic" case - we had a few security issues reported to us that could be only exploited if airflow was run in the default mode, the issues were not exploitable when standalone DAG Processor was used. And that is independent from point 2) regarding the database access - just running in the same process space allows DAG author to impact running scheduler code in various ways (via temporary files for example - but there are multiple other scenarios and attack vectors). 2) Currently the DAG processor has very different requirements than scheduler when it comes to database access. Basically it MUST NOT connect to the Airflow meta-database. We already saw failures yesterday after merging bundle parsing that suggest that it was caused by connection being set in scheduler and DAG processor forked from it via multiprocessing - originally we re-initialized database when we forked the processor, but now DAG processor MUST NOT use the DB, so basically it looks like we leak the DB to the processor. Which is also yet another security issue - if scheduler has a way to initialize the database, it means it has access to the database credentials, and it also means that unless we involve some kind of cgroups docker-like process separation, such forked DAG processor (and this also means DAG author) can access those credentials, and access database. This means pretty much that embedded DAG processor simply breaks the "no DB access by DAG author" assumptions of Airflow 3. J. On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com> wrote: > +1 > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham <shu...@amazon.com.invalid> > wrote: > > > + 1 on this as well. From what I have seen, standalone DAG processing > > results in a minor performance advantage and, importantly, makes the > > Scheduler loop more resilient to DAG processor crashes. > > > > Shubham > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" <daniel.imber...@gmail.com > > <mailto:daniel.imber...@gmail.com>> wrote: > > > > > > CAUTION: This email originated from outside of the organization. Do not > > click links or open attachments unless you can confirm the sender and > know > > the content is safe. > > > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > pouvez > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain > que > > le contenu ne présente aucun risque. > > > > > > > > > > > > > > I'm +1 on this. > > > > > > The fact that there's one more thing to deploy isn't that big of an issue > > given the number of pre-configurable options mentioned (e.g. helm) and a > > full logical separation of DAG parsing and scheduling makes sense (one > > thing that has been a longstanding issue with Airflow is the scheduler > > "Doing too many things", so it would be nice to create a clean divide > > here). > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham <j...@astronomer.io.inva > > <mailto:j...@astronomer.io.inva>lid> > > wrote: > > > > > > > Hello everyone! > > > > > > As I've been working on parsing lately, I want to propose a change in > > that > > > area in time for Airflow 3. > > > > > > Today there are 2 different ways the DAG processor can be run in > Airflow > > - > > > as a standalone component, or embedded in the scheduler. The standalone > > > option came in 2.3, prior to that the only option was it being embedded > > in > > > the scheduler. > > > > > > Why standalone? Generally speaking, parsing scales vertically (single > > loop > > > - more concurrent parsing) while scheduling is scaled horizontally > (many > > > loops). As the DAG processor and scheduler scale in different manners, > > it's > > > awkward to have them live in the same component. There is also a > > resiliency > > > aspect here, no noisy neighbor issues. > > > > > > Really, the only positive of the embedded option is that it's easier to > > > deploy, as there is 1 less component to worry about. However, we > already > > > have a number of components, so 1 more isn't that cumbersome. Everyone > > > using breeze, standalone, the helm chart, a vendor, won't be impacted > > much > > > by this change - in fact, having the log stream separate is a big > > positive! > > > > > > We'd also be able to remove a bit of complexity around reinitialising a > > > bunch of stuff in the child process. > > > > > > Overall, I see primarily positives with this change, and a major > version > > > upgrade is the perfect time to simplify this part of Airflow. Thoughts? > > > > > > Jed > > > > > > > > > > > >