+1 to this proposal -- Regards, Vishnu Chilukoori
On Tue, Jan 14, 2025 at 5:30 PM Oliveira, Niko <oniko...@amazon.com.invalid> wrote: > BIG +1 to this proposal! > > ________________________________ > From: Jarek Potiuk <ja...@potiuk.com> > Sent: Friday, January 10, 2025 8:54:50 AM > To: dev@airflow.apache.org > Subject: RE: [EXT] [DISCUSS] Drop support for the DAG processor embedded > in the scheduler > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que > le contenu ne présente aucun risque. > > > > With Task API :) > > On Fri, Jan 10, 2025 at 5:54 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > Dag processor does not access the DB at all - it communicates with Task > > SDK to write the serialized DAGS. > > > > On Fri, Jan 10, 2025 at 4:51 PM Igor Kholopov > <ikholo...@google.com.invalid> > > wrote: > > > >> The only thing that we need to clear out before this is sealed - what > are > >> we going to do with SQLite? SQLite supports concurrent connections if > the > >> processes are on the same host and we already have WAL enabled by Ash in > >> https://github.com/apache/airflow/pull/44839/files. But I think we > still > >> need to do some additional smoke-testing to confirm that SQLite > >> performance > >> doesn't degrade drastically with dag-processor and scheduler writing to > it > >> concurrently. > >> > >> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka > <vik...@astronomer.io.invalid > >> > > >> wrote: > >> > >> > +1 on this > >> > > >> > For many reasons, which have already been brought up in the thread. > >> > > >> > > >> > On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov > >> <ikholo...@google.com.invalid > >> > > > >> > wrote: > >> > > >> > > +1, there are a lot of old code paths that exist only because of the > >> > > embedding in the scheduler support. Focusing on a single supported > >> mode > >> > of > >> > > operation will allow us to significantly reduce the size (and > >> complexity) > >> > > of the DAG processing code. > >> > > > >> > > On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun < > >> pierrejb...@gmail.com> > >> > > wrote: > >> > > > >> > > > +1 > >> > > > > >> > > > On Fri, Jan 10, 2025 at 11:01 AM Michał Modras > >> > > > <michalmod...@google.com.invalid> wrote: > >> > > > > >> > > > > +1 - separating these workloads makes sense to me - we remove > >> > > > > unnecessary coupling and make them more single-responsibility, > >> which > >> > > > eases > >> > > > > reasoning about the system and any potential debugging > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com > > > >> > > wrote: > >> > > > > > >> > > > > > Yeah, purely from operational perspective, debugging issues > >> become > >> > > lots > >> > > > > > simpler if they are separated as one is CPU hungry while the > >> other > >> > is > >> > > > > > memory hungry! > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com> > >> > wrote: > >> > > > > > > >> > > > > > > +1 (or rather +10). There are two additional things - both > >> > related > >> > > to > >> > > > > > > security and our new security model. > >> > > > > > > > >> > > > > > > 1) separate standalone dag processor follows "secure by > >> design" > >> > > > > > principle. > >> > > > > > > Having scheduler and dag file processor sharing the same > >> > "process" > >> > > > > space > >> > > > > > is > >> > > > > > > a problem with isolation of DAG Author controlled security > >> > > perimeter > >> > > > > and > >> > > > > > > scheduler perimeter. While they were separate processes, > it's > >> > just > >> > > > > > > inherently unsafe (from the security model perspective) to > >> have > >> > the > >> > > > DAG > >> > > > > > > processor started as a sub-process. And this is not an > >> "academic" > >> > > > case > >> > > > > - > >> > > > > > we > >> > > > > > > had a few security issues reported to us that could be only > >> > > exploited > >> > > > > if > >> > > > > > > airflow was run in the default mode, the issues were not > >> > > exploitable > >> > > > > when > >> > > > > > > standalone DAG Processor was used. And that is independent > >> from > >> > > point > >> > > > > 2) > >> > > > > > > regarding the database access - just running in the same > >> process > >> > > > space > >> > > > > > > allows DAG author to impact running scheduler code in > various > >> > ways > >> > > > (via > >> > > > > > > temporary files for example - but there are multiple other > >> > > scenarios > >> > > > > and > >> > > > > > > attack vectors). > >> > > > > > > > >> > > > > > > 2) Currently the DAG processor has very different > requirements > >> > than > >> > > > > > > scheduler when it comes to database access. Basically it > MUST > >> NOT > >> > > > > connect > >> > > > > > > to the Airflow meta-database. We already saw failures > >> yesterday > >> > > after > >> > > > > > > merging bundle parsing that suggest that it was caused by > >> > > connection > >> > > > > > being > >> > > > > > > set in scheduler and DAG processor forked from it via > >> > > > multiprocessing - > >> > > > > > > originally we re-initialized database when we forked the > >> > processor, > >> > > > but > >> > > > > > now > >> > > > > > > DAG processor MUST NOT use the DB, so basically it looks > like > >> we > >> > > leak > >> > > > > the > >> > > > > > > DB to the processor. Which is also yet another security > issue > >> - > >> > if > >> > > > > > > scheduler has a way to initialize the database, it means it > >> has > >> > > > access > >> > > > > to > >> > > > > > > the database credentials, and it also means that unless we > >> > involve > >> > > > some > >> > > > > > > kind of cgroups docker-like process separation, such forked > >> DAG > >> > > > > processor > >> > > > > > > (and this also means DAG author) can access those > credentials, > >> > and > >> > > > > access > >> > > > > > > database. This means pretty much that embedded DAG processor > >> > simply > >> > > > > > breaks > >> > > > > > > the "no DB access by DAG author" assumptions of Airflow 3. > >> > > > > > > > >> > > > > > > J. > >> > > > > > > > >> > > > > > > > >> > > > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik < > >> kaxiln...@gmail.com> > >> > > > > wrote: > >> > > > > > > > >> > > > > > > > +1 > >> > > > > > > > > >> > > > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham > >> > > > > <shu...@amazon.com.invalid > >> > > > > > > > >> > > > > > > > wrote: > >> > > > > > > > > >> > > > > > > > > + 1 on this as well. From what I have seen, standalone > DAG > >> > > > > processing > >> > > > > > > > > results in a minor performance advantage and, > importantly, > >> > > makes > >> > > > > the > >> > > > > > > > > Scheduler loop more resilient to DAG processor crashes. > >> > > > > > > > > > >> > > > > > > > > Shubham > >> > > > > > > > > > >> > > > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" < > >> > > > > > daniel.imber...@gmail.com > >> > > > > > > > > <mailto:daniel.imber...@gmail.com>> wrote: > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > CAUTION: This email originated from outside of the > >> > > organization. > >> > > > Do > >> > > > > > not > >> > > > > > > > > click links or open attachments unless you can confirm > the > >> > > sender > >> > > > > and > >> > > > > > > > know > >> > > > > > > > > the content is safe. > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un > >> > > expéditeur > >> > > > > > > externe. > >> > > > > > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce > jointe > >> si > >> > > vous > >> > > > > ne > >> > > > > > > > pouvez > >> > > > > > > > > pas confirmer l’identité de l’expéditeur et si vous > n’êtes > >> > pas > >> > > > > > certain > >> > > > > > > > que > >> > > > > > > > > le contenu ne présente aucun risque. > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > I'm +1 on this. > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > The fact that there's one more thing to deploy isn't > that > >> big > >> > > of > >> > > > an > >> > > > > > > issue > >> > > > > > > > > given the number of pre-configurable options mentioned > >> (e.g. > >> > > > helm) > >> > > > > > and > >> > > > > > > a > >> > > > > > > > > full logical separation of DAG parsing and scheduling > >> makes > >> > > sense > >> > > > > > (one > >> > > > > > > > > thing that has been a longstanding issue with Airflow is > >> the > >> > > > > > scheduler > >> > > > > > > > > "Doing too many things", so it would be nice to create a > >> > clean > >> > > > > divide > >> > > > > > > > > here). > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham > >> > > > > <j...@astronomer.io.inva > >> > > > > > > > > <mailto:j...@astronomer.io.inva>lid> > >> > > > > > > > > wrote: > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > Hello everyone! > >> > > > > > > > > > > >> > > > > > > > > > As I've been working on parsing lately, I want to > >> propose a > >> > > > > change > >> > > > > > in > >> > > > > > > > > that > >> > > > > > > > > > area in time for Airflow 3. > >> > > > > > > > > > > >> > > > > > > > > > Today there are 2 different ways the DAG processor can > >> be > >> > run > >> > > > in > >> > > > > > > > Airflow > >> > > > > > > > > - > >> > > > > > > > > > as a standalone component, or embedded in the > scheduler. > >> > The > >> > > > > > > standalone > >> > > > > > > > > > option came in 2.3, prior to that the only option was > it > >> > > being > >> > > > > > > embedded > >> > > > > > > > > in > >> > > > > > > > > > the scheduler. > >> > > > > > > > > > > >> > > > > > > > > > Why standalone? Generally speaking, parsing scales > >> > vertically > >> > > > > > (single > >> > > > > > > > > loop > >> > > > > > > > > > - more concurrent parsing) while scheduling is scaled > >> > > > > horizontally > >> > > > > > > > (many > >> > > > > > > > > > loops). As the DAG processor and scheduler scale in > >> > different > >> > > > > > > manners, > >> > > > > > > > > it's > >> > > > > > > > > > awkward to have them live in the same component. There > >> is > >> > > also > >> > > > a > >> > > > > > > > > resiliency > >> > > > > > > > > > aspect here, no noisy neighbor issues. > >> > > > > > > > > > > >> > > > > > > > > > Really, the only positive of the embedded option is > that > >> > it's > >> > > > > > easier > >> > > > > > > to > >> > > > > > > > > > deploy, as there is 1 less component to worry about. > >> > However, > >> > > > we > >> > > > > > > > already > >> > > > > > > > > > have a number of components, so 1 more isn't that > >> > cumbersome. > >> > > > > > > Everyone > >> > > > > > > > > > using breeze, standalone, the helm chart, a vendor, > >> won't > >> > be > >> > > > > > impacted > >> > > > > > > > > much > >> > > > > > > > > > by this change - in fact, having the log stream > separate > >> > is a > >> > > > big > >> > > > > > > > > positive! > >> > > > > > > > > > > >> > > > > > > > > > We'd also be able to remove a bit of complexity around > >> > > > > > > reinitialising a > >> > > > > > > > > > bunch of stuff in the child process. > >> > > > > > > > > > > >> > > > > > > > > > Overall, I see primarily positives with this change, > >> and a > >> > > > major > >> > > > > > > > version > >> > > > > > > > > > upgrade is the perfect time to simplify this part of > >> > Airflow. > >> > > > > > > Thoughts? > >> > > > > > > > > > > >> > > > > > > > > > Jed > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >