+1 - separating these workloads makes sense to me - we remove unnecessary coupling and make them more single-responsibility, which eases reasoning about the system and any potential debugging
On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com> wrote: > Yeah, purely from operational perspective, debugging issues become lots > simpler if they are separated as one is CPU hungry while the other is > memory hungry! > > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com> wrote: > > > +1 (or rather +10). There are two additional things - both related to > > security and our new security model. > > > > 1) separate standalone dag processor follows "secure by design" > principle. > > Having scheduler and dag file processor sharing the same "process" space > is > > a problem with isolation of DAG Author controlled security perimeter and > > scheduler perimeter. While they were separate processes, it's just > > inherently unsafe (from the security model perspective) to have the DAG > > processor started as a sub-process. And this is not an "academic" case - > we > > had a few security issues reported to us that could be only exploited if > > airflow was run in the default mode, the issues were not exploitable when > > standalone DAG Processor was used. And that is independent from point 2) > > regarding the database access - just running in the same process space > > allows DAG author to impact running scheduler code in various ways (via > > temporary files for example - but there are multiple other scenarios and > > attack vectors). > > > > 2) Currently the DAG processor has very different requirements than > > scheduler when it comes to database access. Basically it MUST NOT connect > > to the Airflow meta-database. We already saw failures yesterday after > > merging bundle parsing that suggest that it was caused by connection > being > > set in scheduler and DAG processor forked from it via multiprocessing - > > originally we re-initialized database when we forked the processor, but > now > > DAG processor MUST NOT use the DB, so basically it looks like we leak the > > DB to the processor. Which is also yet another security issue - if > > scheduler has a way to initialize the database, it means it has access to > > the database credentials, and it also means that unless we involve some > > kind of cgroups docker-like process separation, such forked DAG processor > > (and this also means DAG author) can access those credentials, and access > > database. This means pretty much that embedded DAG processor simply > breaks > > the "no DB access by DAG author" assumptions of Airflow 3. > > > > J. > > > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com> wrote: > > > > > +1 > > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham <shu...@amazon.com.invalid > > > > > wrote: > > > > > > > + 1 on this as well. From what I have seen, standalone DAG processing > > > > results in a minor performance advantage and, importantly, makes the > > > > Scheduler loop more resilient to DAG processor crashes. > > > > > > > > Shubham > > > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" < > daniel.imber...@gmail.com > > > > <mailto:daniel.imber...@gmail.com>> wrote: > > > > > > > > > > > > CAUTION: This email originated from outside of the organization. Do > not > > > > click links or open attachments unless you can confirm the sender and > > > know > > > > the content is safe. > > > > > > > > > > > > > > > > > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur > > externe. > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > > > pouvez > > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas > certain > > > que > > > > le contenu ne présente aucun risque. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm +1 on this. > > > > > > > > > > > > The fact that there's one more thing to deploy isn't that big of an > > issue > > > > given the number of pre-configurable options mentioned (e.g. helm) > and > > a > > > > full logical separation of DAG parsing and scheduling makes sense > (one > > > > thing that has been a longstanding issue with Airflow is the > scheduler > > > > "Doing too many things", so it would be nice to create a clean divide > > > > here). > > > > > > > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham <j...@astronomer.io.inva > > > > <mailto:j...@astronomer.io.inva>lid> > > > > wrote: > > > > > > > > > > > > > Hello everyone! > > > > > > > > > > As I've been working on parsing lately, I want to propose a change > in > > > > that > > > > > area in time for Airflow 3. > > > > > > > > > > Today there are 2 different ways the DAG processor can be run in > > > Airflow > > > > - > > > > > as a standalone component, or embedded in the scheduler. The > > standalone > > > > > option came in 2.3, prior to that the only option was it being > > embedded > > > > in > > > > > the scheduler. > > > > > > > > > > Why standalone? Generally speaking, parsing scales vertically > (single > > > > loop > > > > > - more concurrent parsing) while scheduling is scaled horizontally > > > (many > > > > > loops). As the DAG processor and scheduler scale in different > > manners, > > > > it's > > > > > awkward to have them live in the same component. There is also a > > > > resiliency > > > > > aspect here, no noisy neighbor issues. > > > > > > > > > > Really, the only positive of the embedded option is that it's > easier > > to > > > > > deploy, as there is 1 less component to worry about. However, we > > > already > > > > > have a number of components, so 1 more isn't that cumbersome. > > Everyone > > > > > using breeze, standalone, the helm chart, a vendor, won't be > impacted > > > > much > > > > > by this change - in fact, having the log stream separate is a big > > > > positive! > > > > > > > > > > We'd also be able to remove a bit of complexity around > > reinitialising a > > > > > bunch of stuff in the child process. > > > > > > > > > > Overall, I see primarily positives with this change, and a major > > > version > > > > > upgrade is the perfect time to simplify this part of Airflow. > > Thoughts? > > > > > > > > > > Jed > > > > > > > > > > > > > > > > > > > > > > > > > > >