Dag processor does not access the DB at all - it communicates with Task SDK to write the serialized DAGS.
On Fri, Jan 10, 2025 at 4:51 PM Igor Kholopov <ikholo...@google.com.invalid> wrote: > The only thing that we need to clear out before this is sealed - what are > we going to do with SQLite? SQLite supports concurrent connections if the > processes are on the same host and we already have WAL enabled by Ash in > https://github.com/apache/airflow/pull/44839/files. But I think we still > need to do some additional smoke-testing to confirm that SQLite performance > doesn't degrade drastically with dag-processor and scheduler writing to it > concurrently. > > On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka <vik...@astronomer.io.invalid> > wrote: > > > +1 on this > > > > For many reasons, which have already been brought up in the thread. > > > > > > On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov > <ikholo...@google.com.invalid > > > > > wrote: > > > > > +1, there are a lot of old code paths that exist only because of the > > > embedding in the scheduler support. Focusing on a single supported mode > > of > > > operation will allow us to significantly reduce the size (and > complexity) > > > of the DAG processing code. > > > > > > On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun < > pierrejb...@gmail.com> > > > wrote: > > > > > > > +1 > > > > > > > > On Fri, Jan 10, 2025 at 11:01 AM Michał Modras > > > > <michalmod...@google.com.invalid> wrote: > > > > > > > > > +1 - separating these workloads makes sense to me - we remove > > > > > unnecessary coupling and make them more single-responsibility, > which > > > > eases > > > > > reasoning about the system and any potential debugging > > > > > > > > > > > > > > > > > > > > On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com> > > > wrote: > > > > > > > > > > > Yeah, purely from operational perspective, debugging issues > become > > > lots > > > > > > simpler if they are separated as one is CPU hungry while the > other > > is > > > > > > memory hungry! > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com> > > wrote: > > > > > > > > > > > > > +1 (or rather +10). There are two additional things - both > > related > > > to > > > > > > > security and our new security model. > > > > > > > > > > > > > > 1) separate standalone dag processor follows "secure by design" > > > > > > principle. > > > > > > > Having scheduler and dag file processor sharing the same > > "process" > > > > > space > > > > > > is > > > > > > > a problem with isolation of DAG Author controlled security > > > perimeter > > > > > and > > > > > > > scheduler perimeter. While they were separate processes, it's > > just > > > > > > > inherently unsafe (from the security model perspective) to have > > the > > > > DAG > > > > > > > processor started as a sub-process. And this is not an > "academic" > > > > case > > > > > - > > > > > > we > > > > > > > had a few security issues reported to us that could be only > > > exploited > > > > > if > > > > > > > airflow was run in the default mode, the issues were not > > > exploitable > > > > > when > > > > > > > standalone DAG Processor was used. And that is independent from > > > point > > > > > 2) > > > > > > > regarding the database access - just running in the same > process > > > > space > > > > > > > allows DAG author to impact running scheduler code in various > > ways > > > > (via > > > > > > > temporary files for example - but there are multiple other > > > scenarios > > > > > and > > > > > > > attack vectors). > > > > > > > > > > > > > > 2) Currently the DAG processor has very different requirements > > than > > > > > > > scheduler when it comes to database access. Basically it MUST > NOT > > > > > connect > > > > > > > to the Airflow meta-database. We already saw failures yesterday > > > after > > > > > > > merging bundle parsing that suggest that it was caused by > > > connection > > > > > > being > > > > > > > set in scheduler and DAG processor forked from it via > > > > multiprocessing - > > > > > > > originally we re-initialized database when we forked the > > processor, > > > > but > > > > > > now > > > > > > > DAG processor MUST NOT use the DB, so basically it looks like > we > > > leak > > > > > the > > > > > > > DB to the processor. Which is also yet another security issue - > > if > > > > > > > scheduler has a way to initialize the database, it means it has > > > > access > > > > > to > > > > > > > the database credentials, and it also means that unless we > > involve > > > > some > > > > > > > kind of cgroups docker-like process separation, such forked DAG > > > > > processor > > > > > > > (and this also means DAG author) can access those credentials, > > and > > > > > access > > > > > > > database. This means pretty much that embedded DAG processor > > simply > > > > > > breaks > > > > > > > the "no DB access by DAG author" assumptions of Airflow 3. > > > > > > > > > > > > > > J. > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik < > kaxiln...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > +1 > > > > > > > > > > > > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham > > > > > <shu...@amazon.com.invalid > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > + 1 on this as well. From what I have seen, standalone DAG > > > > > processing > > > > > > > > > results in a minor performance advantage and, importantly, > > > makes > > > > > the > > > > > > > > > Scheduler loop more resilient to DAG processor crashes. > > > > > > > > > > > > > > > > > > Shubham > > > > > > > > > > > > > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" < > > > > > > daniel.imber...@gmail.com > > > > > > > > > <mailto:daniel.imber...@gmail.com>> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > CAUTION: This email originated from outside of the > > > organization. > > > > Do > > > > > > not > > > > > > > > > click links or open attachments unless you can confirm the > > > sender > > > > > and > > > > > > > > know > > > > > > > > > the content is safe. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un > > > expéditeur > > > > > > > externe. > > > > > > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe > si > > > vous > > > > > ne > > > > > > > > pouvez > > > > > > > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes > > pas > > > > > > certain > > > > > > > > que > > > > > > > > > le contenu ne présente aucun risque. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm +1 on this. > > > > > > > > > > > > > > > > > > > > > > > > > > > The fact that there's one more thing to deploy isn't that > big > > > of > > > > an > > > > > > > issue > > > > > > > > > given the number of pre-configurable options mentioned > (e.g. > > > > helm) > > > > > > and > > > > > > > a > > > > > > > > > full logical separation of DAG parsing and scheduling makes > > > sense > > > > > > (one > > > > > > > > > thing that has been a longstanding issue with Airflow is > the > > > > > > scheduler > > > > > > > > > "Doing too many things", so it would be nice to create a > > clean > > > > > divide > > > > > > > > > here). > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham > > > > > <j...@astronomer.io.inva > > > > > > > > > <mailto:j...@astronomer.io.inva>lid> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello everyone! > > > > > > > > > > > > > > > > > > > > As I've been working on parsing lately, I want to > propose a > > > > > change > > > > > > in > > > > > > > > > that > > > > > > > > > > area in time for Airflow 3. > > > > > > > > > > > > > > > > > > > > Today there are 2 different ways the DAG processor can be > > run > > > > in > > > > > > > > Airflow > > > > > > > > > - > > > > > > > > > > as a standalone component, or embedded in the scheduler. > > The > > > > > > > standalone > > > > > > > > > > option came in 2.3, prior to that the only option was it > > > being > > > > > > > embedded > > > > > > > > > in > > > > > > > > > > the scheduler. > > > > > > > > > > > > > > > > > > > > Why standalone? Generally speaking, parsing scales > > vertically > > > > > > (single > > > > > > > > > loop > > > > > > > > > > - more concurrent parsing) while scheduling is scaled > > > > > horizontally > > > > > > > > (many > > > > > > > > > > loops). As the DAG processor and scheduler scale in > > different > > > > > > > manners, > > > > > > > > > it's > > > > > > > > > > awkward to have them live in the same component. There is > > > also > > > > a > > > > > > > > > resiliency > > > > > > > > > > aspect here, no noisy neighbor issues. > > > > > > > > > > > > > > > > > > > > Really, the only positive of the embedded option is that > > it's > > > > > > easier > > > > > > > to > > > > > > > > > > deploy, as there is 1 less component to worry about. > > However, > > > > we > > > > > > > > already > > > > > > > > > > have a number of components, so 1 more isn't that > > cumbersome. > > > > > > > Everyone > > > > > > > > > > using breeze, standalone, the helm chart, a vendor, won't > > be > > > > > > impacted > > > > > > > > > much > > > > > > > > > > by this change - in fact, having the log stream separate > > is a > > > > big > > > > > > > > > positive! > > > > > > > > > > > > > > > > > > > > We'd also be able to remove a bit of complexity around > > > > > > > reinitialising a > > > > > > > > > > bunch of stuff in the child process. > > > > > > > > > > > > > > > > > > > > Overall, I see primarily positives with this change, and > a > > > > major > > > > > > > > version > > > > > > > > > > upgrade is the perfect time to simplify this part of > > Airflow. > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > Jed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >