With Task API :) On Fri, Jan 10, 2025 at 5:54 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> Dag processor does not access the DB at all - it communicates with Task > SDK to write the serialized DAGS. > > On Fri, Jan 10, 2025 at 4:51 PM Igor Kholopov <ikholo...@google.com.invalid> > wrote: > >> The only thing that we need to clear out before this is sealed - what are >> we going to do with SQLite? SQLite supports concurrent connections if the >> processes are on the same host and we already have WAL enabled by Ash in >> https://github.com/apache/airflow/pull/44839/files. But I think we still >> need to do some additional smoke-testing to confirm that SQLite >> performance >> doesn't degrade drastically with dag-processor and scheduler writing to it >> concurrently. >> >> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka <vik...@astronomer.io.invalid >> > >> wrote: >> >> > +1 on this >> > >> > For many reasons, which have already been brought up in the thread. >> > >> > >> > On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov >> <ikholo...@google.com.invalid >> > > >> > wrote: >> > >> > > +1, there are a lot of old code paths that exist only because of the >> > > embedding in the scheduler support. Focusing on a single supported >> mode >> > of >> > > operation will allow us to significantly reduce the size (and >> complexity) >> > > of the DAG processing code. >> > > >> > > On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun < >> pierrejb...@gmail.com> >> > > wrote: >> > > >> > > > +1 >> > > > >> > > > On Fri, Jan 10, 2025 at 11:01 AM Michał Modras >> > > > <michalmod...@google.com.invalid> wrote: >> > > > >> > > > > +1 - separating these workloads makes sense to me - we remove >> > > > > unnecessary coupling and make them more single-responsibility, >> which >> > > > eases >> > > > > reasoning about the system and any potential debugging >> > > > > >> > > > > >> > > > > >> > > > > On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com> >> > > wrote: >> > > > > >> > > > > > Yeah, purely from operational perspective, debugging issues >> become >> > > lots >> > > > > > simpler if they are separated as one is CPU hungry while the >> other >> > is >> > > > > > memory hungry! >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com> >> > wrote: >> > > > > > >> > > > > > > +1 (or rather +10). There are two additional things - both >> > related >> > > to >> > > > > > > security and our new security model. >> > > > > > > >> > > > > > > 1) separate standalone dag processor follows "secure by >> design" >> > > > > > principle. >> > > > > > > Having scheduler and dag file processor sharing the same >> > "process" >> > > > > space >> > > > > > is >> > > > > > > a problem with isolation of DAG Author controlled security >> > > perimeter >> > > > > and >> > > > > > > scheduler perimeter. While they were separate processes, it's >> > just >> > > > > > > inherently unsafe (from the security model perspective) to >> have >> > the >> > > > DAG >> > > > > > > processor started as a sub-process. And this is not an >> "academic" >> > > > case >> > > > > - >> > > > > > we >> > > > > > > had a few security issues reported to us that could be only >> > > exploited >> > > > > if >> > > > > > > airflow was run in the default mode, the issues were not >> > > exploitable >> > > > > when >> > > > > > > standalone DAG Processor was used. And that is independent >> from >> > > point >> > > > > 2) >> > > > > > > regarding the database access - just running in the same >> process >> > > > space >> > > > > > > allows DAG author to impact running scheduler code in various >> > ways >> > > > (via >> > > > > > > temporary files for example - but there are multiple other >> > > scenarios >> > > > > and >> > > > > > > attack vectors). >> > > > > > > >> > > > > > > 2) Currently the DAG processor has very different requirements >> > than >> > > > > > > scheduler when it comes to database access. Basically it MUST >> NOT >> > > > > connect >> > > > > > > to the Airflow meta-database. We already saw failures >> yesterday >> > > after >> > > > > > > merging bundle parsing that suggest that it was caused by >> > > connection >> > > > > > being >> > > > > > > set in scheduler and DAG processor forked from it via >> > > > multiprocessing - >> > > > > > > originally we re-initialized database when we forked the >> > processor, >> > > > but >> > > > > > now >> > > > > > > DAG processor MUST NOT use the DB, so basically it looks like >> we >> > > leak >> > > > > the >> > > > > > > DB to the processor. Which is also yet another security issue >> - >> > if >> > > > > > > scheduler has a way to initialize the database, it means it >> has >> > > > access >> > > > > to >> > > > > > > the database credentials, and it also means that unless we >> > involve >> > > > some >> > > > > > > kind of cgroups docker-like process separation, such forked >> DAG >> > > > > processor >> > > > > > > (and this also means DAG author) can access those credentials, >> > and >> > > > > access >> > > > > > > database. This means pretty much that embedded DAG processor >> > simply >> > > > > > breaks >> > > > > > > the "no DB access by DAG author" assumptions of Airflow 3. >> > > > > > > >> > > > > > > J. >> > > > > > > >> > > > > > > >> > > > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik < >> kaxiln...@gmail.com> >> > > > > wrote: >> > > > > > > >> > > > > > > > +1 >> > > > > > > > >> > > > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham >> > > > > <shu...@amazon.com.invalid >> > > > > > > >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > + 1 on this as well. From what I have seen, standalone DAG >> > > > > processing >> > > > > > > > > results in a minor performance advantage and, importantly, >> > > makes >> > > > > the >> > > > > > > > > Scheduler loop more resilient to DAG processor crashes. >> > > > > > > > > >> > > > > > > > > Shubham >> > > > > > > > > >> > > > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" < >> > > > > > daniel.imber...@gmail.com >> > > > > > > > > <mailto:daniel.imber...@gmail.com>> wrote: >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > CAUTION: This email originated from outside of the >> > > organization. >> > > > Do >> > > > > > not >> > > > > > > > > click links or open attachments unless you can confirm the >> > > sender >> > > > > and >> > > > > > > > know >> > > > > > > > > the content is safe. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un >> > > expéditeur >> > > > > > > externe. >> > > > > > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe >> si >> > > vous >> > > > > ne >> > > > > > > > pouvez >> > > > > > > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes >> > pas >> > > > > > certain >> > > > > > > > que >> > > > > > > > > le contenu ne présente aucun risque. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > I'm +1 on this. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > The fact that there's one more thing to deploy isn't that >> big >> > > of >> > > > an >> > > > > > > issue >> > > > > > > > > given the number of pre-configurable options mentioned >> (e.g. >> > > > helm) >> > > > > > and >> > > > > > > a >> > > > > > > > > full logical separation of DAG parsing and scheduling >> makes >> > > sense >> > > > > > (one >> > > > > > > > > thing that has been a longstanding issue with Airflow is >> the >> > > > > > scheduler >> > > > > > > > > "Doing too many things", so it would be nice to create a >> > clean >> > > > > divide >> > > > > > > > > here). >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham >> > > > > <j...@astronomer.io.inva >> > > > > > > > > <mailto:j...@astronomer.io.inva>lid> >> > > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > Hello everyone! >> > > > > > > > > > >> > > > > > > > > > As I've been working on parsing lately, I want to >> propose a >> > > > > change >> > > > > > in >> > > > > > > > > that >> > > > > > > > > > area in time for Airflow 3. >> > > > > > > > > > >> > > > > > > > > > Today there are 2 different ways the DAG processor can >> be >> > run >> > > > in >> > > > > > > > Airflow >> > > > > > > > > - >> > > > > > > > > > as a standalone component, or embedded in the scheduler. >> > The >> > > > > > > standalone >> > > > > > > > > > option came in 2.3, prior to that the only option was it >> > > being >> > > > > > > embedded >> > > > > > > > > in >> > > > > > > > > > the scheduler. >> > > > > > > > > > >> > > > > > > > > > Why standalone? Generally speaking, parsing scales >> > vertically >> > > > > > (single >> > > > > > > > > loop >> > > > > > > > > > - more concurrent parsing) while scheduling is scaled >> > > > > horizontally >> > > > > > > > (many >> > > > > > > > > > loops). As the DAG processor and scheduler scale in >> > different >> > > > > > > manners, >> > > > > > > > > it's >> > > > > > > > > > awkward to have them live in the same component. There >> is >> > > also >> > > > a >> > > > > > > > > resiliency >> > > > > > > > > > aspect here, no noisy neighbor issues. >> > > > > > > > > > >> > > > > > > > > > Really, the only positive of the embedded option is that >> > it's >> > > > > > easier >> > > > > > > to >> > > > > > > > > > deploy, as there is 1 less component to worry about. >> > However, >> > > > we >> > > > > > > > already >> > > > > > > > > > have a number of components, so 1 more isn't that >> > cumbersome. >> > > > > > > Everyone >> > > > > > > > > > using breeze, standalone, the helm chart, a vendor, >> won't >> > be >> > > > > > impacted >> > > > > > > > > much >> > > > > > > > > > by this change - in fact, having the log stream separate >> > is a >> > > > big >> > > > > > > > > positive! >> > > > > > > > > > >> > > > > > > > > > We'd also be able to remove a bit of complexity around >> > > > > > > reinitialising a >> > > > > > > > > > bunch of stuff in the child process. >> > > > > > > > > > >> > > > > > > > > > Overall, I see primarily positives with this change, >> and a >> > > > major >> > > > > > > > version >> > > > > > > > > > upgrade is the perfect time to simplify this part of >> > Airflow. >> > > > > > > Thoughts? >> > > > > > > > > > >> > > > > > > > > > Jed >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >