+1 on this

For many reasons, which have already been brought up in the thread.


On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov <ikholo...@google.com.invalid>
wrote:

> +1, there are a lot of old code paths that exist only because of the
> embedding in the scheduler support. Focusing on a single supported mode of
> operation will allow us to significantly reduce the size (and complexity)
> of the DAG processing code.
>
> On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <pierrejb...@gmail.com>
> wrote:
>
> > +1
> >
> > On Fri, Jan 10, 2025 at 11:01 AM Michał Modras
> > <michalmod...@google.com.invalid> wrote:
> >
> > > +1 - separating these workloads makes sense to me - we remove
> > > unnecessary coupling and make them more single-responsibility, which
> > eases
> > > reasoning about the system and any potential debugging
> > >
> > >
> > >
> > > On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com>
> wrote:
> > >
> > > > Yeah, purely from operational perspective, debugging issues become
> lots
> > > > simpler if they are separated as one is CPU hungry while the other is
> > > > memory hungry!
> > > >
> > > >
> > > >
> > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com> wrote:
> > > >
> > > > > +1 (or rather +10). There are two additional things - both related
> to
> > > > > security and our new security model.
> > > > >
> > > > > 1) separate standalone dag processor follows "secure by design"
> > > > principle.
> > > > > Having scheduler and dag file processor sharing the same "process"
> > > space
> > > > is
> > > > > a problem with isolation of DAG Author controlled security
> perimeter
> > > and
> > > > > scheduler perimeter. While they were separate processes, it's just
> > > > > inherently unsafe (from the security model perspective) to have the
> > DAG
> > > > > processor started as a sub-process. And this is not an "academic"
> > case
> > > -
> > > > we
> > > > > had a few security issues reported to us that could be only
> exploited
> > > if
> > > > > airflow was run in the default mode, the issues were not
> exploitable
> > > when
> > > > > standalone DAG Processor was used. And that is independent from
> point
> > > 2)
> > > > > regarding the database access - just running in the same process
> > space
> > > > > allows DAG author to impact running scheduler code in various ways
> > (via
> > > > > temporary files for example - but there are multiple other
> scenarios
> > > and
> > > > > attack vectors).
> > > > >
> > > > > 2) Currently the DAG processor has very different requirements than
> > > > > scheduler when it comes to database access. Basically it MUST NOT
> > > connect
> > > > > to the Airflow meta-database. We already saw failures yesterday
> after
> > > > > merging bundle parsing that suggest that it was caused by
> connection
> > > > being
> > > > > set in scheduler and DAG processor forked from it via
> > multiprocessing -
> > > > > originally we re-initialized database when we forked the processor,
> > but
> > > > now
> > > > > DAG processor MUST NOT use the DB, so basically it looks like we
> leak
> > > the
> > > > > DB to the processor. Which is also yet another security issue - if
> > > > > scheduler has a way to initialize the database, it means it has
> > access
> > > to
> > > > > the database credentials, and it also means that unless we involve
> > some
> > > > > kind of cgroups docker-like process separation, such forked DAG
> > > processor
> > > > > (and this also means DAG author) can access those credentials, and
> > > access
> > > > > database. This means pretty much that embedded DAG processor simply
> > > > breaks
> > > > > the "no DB access by DAG author" assumptions of Airflow 3.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <kaxiln...@gmail.com>
> > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham
> > > <shu...@amazon.com.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > + 1 on this as well. From what I have seen, standalone DAG
> > > processing
> > > > > > > results in a minor performance advantage and, importantly,
> makes
> > > the
> > > > > > > Scheduler loop more resilient to DAG processor crashes.
> > > > > > >
> > > > > > > Shubham
> > > > > > >
> > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" <
> > > > daniel.imber...@gmail.com
> > > > > > > <mailto:daniel.imber...@gmail.com>> wrote:
> > > > > > >
> > > > > > >
> > > > > > > CAUTION: This email originated from outside of the
> organization.
> > Do
> > > > not
> > > > > > > click links or open attachments unless you can confirm the
> sender
> > > and
> > > > > > know
> > > > > > > the content is safe.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un
> expéditeur
> > > > > externe.
> > > > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si
> vous
> > > ne
> > > > > > pouvez
> > > > > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> > > > certain
> > > > > > que
> > > > > > > le contenu ne présente aucun risque.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I'm +1 on this.
> > > > > > >
> > > > > > >
> > > > > > > The fact that there's one more thing to deploy isn't that big
> of
> > an
> > > > > issue
> > > > > > > given the number of pre-configurable options mentioned (e.g.
> > helm)
> > > > and
> > > > > a
> > > > > > > full logical separation of DAG parsing and scheduling makes
> sense
> > > > (one
> > > > > > > thing that has been a longstanding issue with Airflow is the
> > > > scheduler
> > > > > > > "Doing too many things", so it would be nice to create a clean
> > > divide
> > > > > > > here).
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham
> > > <j...@astronomer.io.inva
> > > > > > > <mailto:j...@astronomer.io.inva>lid>
> > > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > Hello everyone!
> > > > > > > >
> > > > > > > > As I've been working on parsing lately, I want to propose a
> > > change
> > > > in
> > > > > > > that
> > > > > > > > area in time for Airflow 3.
> > > > > > > >
> > > > > > > > Today there are 2 different ways the DAG processor can be run
> > in
> > > > > > Airflow
> > > > > > > -
> > > > > > > > as a standalone component, or embedded in the scheduler. The
> > > > > standalone
> > > > > > > > option came in 2.3, prior to that the only option was it
> being
> > > > > embedded
> > > > > > > in
> > > > > > > > the scheduler.
> > > > > > > >
> > > > > > > > Why standalone? Generally speaking, parsing scales vertically
> > > > (single
> > > > > > > loop
> > > > > > > > - more concurrent parsing) while scheduling is scaled
> > > horizontally
> > > > > > (many
> > > > > > > > loops). As the DAG processor and scheduler scale in different
> > > > > manners,
> > > > > > > it's
> > > > > > > > awkward to have them live in the same component. There is
> also
> > a
> > > > > > > resiliency
> > > > > > > > aspect here, no noisy neighbor issues.
> > > > > > > >
> > > > > > > > Really, the only positive of the embedded option is that it's
> > > > easier
> > > > > to
> > > > > > > > deploy, as there is 1 less component to worry about. However,
> > we
> > > > > > already
> > > > > > > > have a number of components, so 1 more isn't that cumbersome.
> > > > > Everyone
> > > > > > > > using breeze, standalone, the helm chart, a vendor, won't be
> > > > impacted
> > > > > > > much
> > > > > > > > by this change - in fact, having the log stream separate is a
> > big
> > > > > > > positive!
> > > > > > > >
> > > > > > > > We'd also be able to remove a bit of complexity around
> > > > > reinitialising a
> > > > > > > > bunch of stuff in the child process.
> > > > > > > >
> > > > > > > > Overall, I see primarily positives with this change, and a
> > major
> > > > > > version
> > > > > > > > upgrade is the perfect time to simplify this part of Airflow.
> > > > > Thoughts?
> > > > > > > >
> > > > > > > > Jed
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to