+1 to this proposal

--
Regards,
Vishnu Chilukoori

On Tue, Jan 14, 2025 at 5:30 PM Oliveira, Niko <oniko...@amazon.com.invalid>
wrote:

> BIG +1 to this proposal!
>
> ________________________________
> From: Jarek Potiuk <ja...@potiuk.com>
> Sent: Friday, January 10, 2025 8:54:50 AM
> To: dev@airflow.apache.org
> Subject: RE: [EXT] [DISCUSS] Drop support for the DAG processor embedded
> in the scheduler
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
> With Task API :)
>
> On Fri, Jan 10, 2025 at 5:54 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > Dag processor does not access the DB at all - it communicates with Task
> > SDK to write the serialized DAGS.
> >
> > On Fri, Jan 10, 2025 at 4:51 PM Igor Kholopov
> <ikholo...@google.com.invalid>
> > wrote:
> >
> >> The only thing that we need to clear out before this is sealed - what
> are
> >> we going to do with SQLite? SQLite supports concurrent connections if
> the
> >> processes are on the same host and we already have WAL enabled by Ash in
> >> https://github.com/apache/airflow/pull/44839/files. But I think we
> still
> >> need to do some additional smoke-testing to confirm that SQLite
> >> performance
> >> doesn't degrade drastically with dag-processor and scheduler writing to
> it
> >> concurrently.
> >>
> >> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka
> <vik...@astronomer.io.invalid
> >> >
> >> wrote:
> >>
> >> > +1 on this
> >> >
> >> > For many reasons, which have already been brought up in the thread.
> >> >
> >> >
> >> > On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov
> >> <ikholo...@google.com.invalid
> >> > >
> >> > wrote:
> >> >
> >> > > +1, there are a lot of old code paths that exist only because of the
> >> > > embedding in the scheduler support. Focusing on a single supported
> >> mode
> >> > of
> >> > > operation will allow us to significantly reduce the size (and
> >> complexity)
> >> > > of the DAG processing code.
> >> > >
> >> > > On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <
> >> pierrejb...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > +1
> >> > > >
> >> > > > On Fri, Jan 10, 2025 at 11:01 AM Michał Modras
> >> > > > <michalmod...@google.com.invalid> wrote:
> >> > > >
> >> > > > > +1 - separating these workloads makes sense to me - we remove
> >> > > > > unnecessary coupling and make them more single-responsibility,
> >> which
> >> > > > eases
> >> > > > > reasoning about the system and any potential debugging
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > Yeah, purely from operational perspective, debugging issues
> >> become
> >> > > lots
> >> > > > > > simpler if they are separated as one is CPU hungry while the
> >> other
> >> > is
> >> > > > > > memory hungry!
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com>
> >> > wrote:
> >> > > > > >
> >> > > > > > > +1 (or rather +10). There are two additional things - both
> >> > related
> >> > > to
> >> > > > > > > security and our new security model.
> >> > > > > > >
> >> > > > > > > 1) separate standalone dag processor follows "secure by
> >> design"
> >> > > > > > principle.
> >> > > > > > > Having scheduler and dag file processor sharing the same
> >> > "process"
> >> > > > > space
> >> > > > > > is
> >> > > > > > > a problem with isolation of DAG Author controlled security
> >> > > perimeter
> >> > > > > and
> >> > > > > > > scheduler perimeter. While they were separate processes,
> it's
> >> > just
> >> > > > > > > inherently unsafe (from the security model perspective) to
> >> have
> >> > the
> >> > > > DAG
> >> > > > > > > processor started as a sub-process. And this is not an
> >> "academic"
> >> > > > case
> >> > > > > -
> >> > > > > > we
> >> > > > > > > had a few security issues reported to us that could be only
> >> > > exploited
> >> > > > > if
> >> > > > > > > airflow was run in the default mode, the issues were not
> >> > > exploitable
> >> > > > > when
> >> > > > > > > standalone DAG Processor was used. And that is independent
> >> from
> >> > > point
> >> > > > > 2)
> >> > > > > > > regarding the database access - just running in the same
> >> process
> >> > > > space
> >> > > > > > > allows DAG author to impact running scheduler code in
> various
> >> > ways
> >> > > > (via
> >> > > > > > > temporary files for example - but there are multiple other
> >> > > scenarios
> >> > > > > and
> >> > > > > > > attack vectors).
> >> > > > > > >
> >> > > > > > > 2) Currently the DAG processor has very different
> requirements
> >> > than
> >> > > > > > > scheduler when it comes to database access. Basically it
> MUST
> >> NOT
> >> > > > > connect
> >> > > > > > > to the Airflow meta-database. We already saw failures
> >> yesterday
> >> > > after
> >> > > > > > > merging bundle parsing that suggest that it was caused by
> >> > > connection
> >> > > > > > being
> >> > > > > > > set in scheduler and DAG processor forked from it via
> >> > > > multiprocessing -
> >> > > > > > > originally we re-initialized database when we forked the
> >> > processor,
> >> > > > but
> >> > > > > > now
> >> > > > > > > DAG processor MUST NOT use the DB, so basically it looks
> like
> >> we
> >> > > leak
> >> > > > > the
> >> > > > > > > DB to the processor. Which is also yet another security
> issue
> >> -
> >> > if
> >> > > > > > > scheduler has a way to initialize the database, it means it
> >> has
> >> > > > access
> >> > > > > to
> >> > > > > > > the database credentials, and it also means that unless we
> >> > involve
> >> > > > some
> >> > > > > > > kind of cgroups docker-like process separation, such forked
> >> DAG
> >> > > > > processor
> >> > > > > > > (and this also means DAG author) can access those
> credentials,
> >> > and
> >> > > > > access
> >> > > > > > > database. This means pretty much that embedded DAG processor
> >> > simply
> >> > > > > > breaks
> >> > > > > > > the "no DB access by DAG author" assumptions of Airflow 3.
> >> > > > > > >
> >> > > > > > > J.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <
> >> kaxiln...@gmail.com>
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > +1
> >> > > > > > > >
> >> > > > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham
> >> > > > > <shu...@amazon.com.invalid
> >> > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > + 1 on this as well. From what I have seen, standalone
> DAG
> >> > > > > processing
> >> > > > > > > > > results in a minor performance advantage and,
> importantly,
> >> > > makes
> >> > > > > the
> >> > > > > > > > > Scheduler loop more resilient to DAG processor crashes.
> >> > > > > > > > >
> >> > > > > > > > > Shubham
> >> > > > > > > > >
> >> > > > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" <
> >> > > > > > daniel.imber...@gmail.com
> >> > > > > > > > > <mailto:daniel.imber...@gmail.com>> wrote:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > CAUTION: This email originated from outside of the
> >> > > organization.
> >> > > > Do
> >> > > > > > not
> >> > > > > > > > > click links or open attachments unless you can confirm
> the
> >> > > sender
> >> > > > > and
> >> > > > > > > > know
> >> > > > > > > > > the content is safe.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un
> >> > > expéditeur
> >> > > > > > > externe.
> >> > > > > > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe
> >> si
> >> > > vous
> >> > > > > ne
> >> > > > > > > > pouvez
> >> > > > > > > > > pas confirmer l’identité de l’expéditeur et si vous
> n’êtes
> >> > pas
> >> > > > > > certain
> >> > > > > > > > que
> >> > > > > > > > > le contenu ne présente aucun risque.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > I'm +1 on this.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > The fact that there's one more thing to deploy isn't
> that
> >> big
> >> > > of
> >> > > > an
> >> > > > > > > issue
> >> > > > > > > > > given the number of pre-configurable options mentioned
> >> (e.g.
> >> > > > helm)
> >> > > > > > and
> >> > > > > > > a
> >> > > > > > > > > full logical separation of DAG parsing and scheduling
> >> makes
> >> > > sense
> >> > > > > > (one
> >> > > > > > > > > thing that has been a longstanding issue with Airflow is
> >> the
> >> > > > > > scheduler
> >> > > > > > > > > "Doing too many things", so it would be nice to create a
> >> > clean
> >> > > > > divide
> >> > > > > > > > > here).
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham
> >> > > > > <j...@astronomer.io.inva
> >> > > > > > > > > <mailto:j...@astronomer.io.inva>lid>
> >> > > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > > Hello everyone!
> >> > > > > > > > > >
> >> > > > > > > > > > As I've been working on parsing lately, I want to
> >> propose a
> >> > > > > change
> >> > > > > > in
> >> > > > > > > > > that
> >> > > > > > > > > > area in time for Airflow 3.
> >> > > > > > > > > >
> >> > > > > > > > > > Today there are 2 different ways the DAG processor can
> >> be
> >> > run
> >> > > > in
> >> > > > > > > > Airflow
> >> > > > > > > > > -
> >> > > > > > > > > > as a standalone component, or embedded in the
> scheduler.
> >> > The
> >> > > > > > > standalone
> >> > > > > > > > > > option came in 2.3, prior to that the only option was
> it
> >> > > being
> >> > > > > > > embedded
> >> > > > > > > > > in
> >> > > > > > > > > > the scheduler.
> >> > > > > > > > > >
> >> > > > > > > > > > Why standalone? Generally speaking, parsing scales
> >> > vertically
> >> > > > > > (single
> >> > > > > > > > > loop
> >> > > > > > > > > > - more concurrent parsing) while scheduling is scaled
> >> > > > > horizontally
> >> > > > > > > > (many
> >> > > > > > > > > > loops). As the DAG processor and scheduler scale in
> >> > different
> >> > > > > > > manners,
> >> > > > > > > > > it's
> >> > > > > > > > > > awkward to have them live in the same component. There
> >> is
> >> > > also
> >> > > > a
> >> > > > > > > > > resiliency
> >> > > > > > > > > > aspect here, no noisy neighbor issues.
> >> > > > > > > > > >
> >> > > > > > > > > > Really, the only positive of the embedded option is
> that
> >> > it's
> >> > > > > > easier
> >> > > > > > > to
> >> > > > > > > > > > deploy, as there is 1 less component to worry about.
> >> > However,
> >> > > > we
> >> > > > > > > > already
> >> > > > > > > > > > have a number of components, so 1 more isn't that
> >> > cumbersome.
> >> > > > > > > Everyone
> >> > > > > > > > > > using breeze, standalone, the helm chart, a vendor,
> >> won't
> >> > be
> >> > > > > > impacted
> >> > > > > > > > > much
> >> > > > > > > > > > by this change - in fact, having the log stream
> separate
> >> > is a
> >> > > > big
> >> > > > > > > > > positive!
> >> > > > > > > > > >
> >> > > > > > > > > > We'd also be able to remove a bit of complexity around
> >> > > > > > > reinitialising a
> >> > > > > > > > > > bunch of stuff in the child process.
> >> > > > > > > > > >
> >> > > > > > > > > > Overall, I see primarily positives with this change,
> >> and a
> >> > > > major
> >> > > > > > > > version
> >> > > > > > > > > > upgrade is the perfect time to simplify this part of
> >> > Airflow.
> >> > > > > > > Thoughts?
> >> > > > > > > > > >
> >> > > > > > > > > > Jed
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply via email to