With Task API :)

On Fri, Jan 10, 2025 at 5:54 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Dag processor does not access the DB at all - it communicates with Task
> SDK to write the serialized DAGS.
>
> On Fri, Jan 10, 2025 at 4:51 PM Igor Kholopov <ikholo...@google.com.invalid>
> wrote:
>
>> The only thing that we need to clear out before this is sealed - what are
>> we going to do with SQLite? SQLite supports concurrent connections if the
>> processes are on the same host and we already have WAL enabled by Ash in
>> https://github.com/apache/airflow/pull/44839/files. But I think we still
>> need to do some additional smoke-testing to confirm that SQLite
>> performance
>> doesn't degrade drastically with dag-processor and scheduler writing to it
>> concurrently.
>>
>> On Fri, Jan 10, 2025 at 4:33 PM Vikram Koka <vik...@astronomer.io.invalid
>> >
>> wrote:
>>
>> > +1 on this
>> >
>> > For many reasons, which have already been brought up in the thread.
>> >
>> >
>> > On Fri, Jan 10, 2025 at 7:30 AM Igor Kholopov
>> <ikholo...@google.com.invalid
>> > >
>> > wrote:
>> >
>> > > +1, there are a lot of old code paths that exist only because of the
>> > > embedding in the scheduler support. Focusing on a single supported
>> mode
>> > of
>> > > operation will allow us to significantly reduce the size (and
>> complexity)
>> > > of the DAG processing code.
>> > >
>> > > On Fri, Jan 10, 2025 at 11:49 AM Pierre Jeambrun <
>> pierrejb...@gmail.com>
>> > > wrote:
>> > >
>> > > > +1
>> > > >
>> > > > On Fri, Jan 10, 2025 at 11:01 AM Michał Modras
>> > > > <michalmod...@google.com.invalid> wrote:
>> > > >
>> > > > > +1 - separating these workloads makes sense to me - we remove
>> > > > > unnecessary coupling and make them more single-responsibility,
>> which
>> > > > eases
>> > > > > reasoning about the system and any potential debugging
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Fri, Jan 10, 2025 at 9:15 AM Kaxil Naik <kaxiln...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Yeah, purely from operational perspective, debugging issues
>> become
>> > > lots
>> > > > > > simpler if they are separated as one is CPU hungry while the
>> other
>> > is
>> > > > > > memory hungry!
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Fri, 10 Jan 2025 at 13:09, Jarek Potiuk <ja...@potiuk.com>
>> > wrote:
>> > > > > >
>> > > > > > > +1 (or rather +10). There are two additional things - both
>> > related
>> > > to
>> > > > > > > security and our new security model.
>> > > > > > >
>> > > > > > > 1) separate standalone dag processor follows "secure by
>> design"
>> > > > > > principle.
>> > > > > > > Having scheduler and dag file processor sharing the same
>> > "process"
>> > > > > space
>> > > > > > is
>> > > > > > > a problem with isolation of DAG Author controlled security
>> > > perimeter
>> > > > > and
>> > > > > > > scheduler perimeter. While they were separate processes, it's
>> > just
>> > > > > > > inherently unsafe (from the security model perspective) to
>> have
>> > the
>> > > > DAG
>> > > > > > > processor started as a sub-process. And this is not an
>> "academic"
>> > > > case
>> > > > > -
>> > > > > > we
>> > > > > > > had a few security issues reported to us that could be only
>> > > exploited
>> > > > > if
>> > > > > > > airflow was run in the default mode, the issues were not
>> > > exploitable
>> > > > > when
>> > > > > > > standalone DAG Processor was used. And that is independent
>> from
>> > > point
>> > > > > 2)
>> > > > > > > regarding the database access - just running in the same
>> process
>> > > > space
>> > > > > > > allows DAG author to impact running scheduler code in various
>> > ways
>> > > > (via
>> > > > > > > temporary files for example - but there are multiple other
>> > > scenarios
>> > > > > and
>> > > > > > > attack vectors).
>> > > > > > >
>> > > > > > > 2) Currently the DAG processor has very different requirements
>> > than
>> > > > > > > scheduler when it comes to database access. Basically it MUST
>> NOT
>> > > > > connect
>> > > > > > > to the Airflow meta-database. We already saw failures
>> yesterday
>> > > after
>> > > > > > > merging bundle parsing that suggest that it was caused by
>> > > connection
>> > > > > > being
>> > > > > > > set in scheduler and DAG processor forked from it via
>> > > > multiprocessing -
>> > > > > > > originally we re-initialized database when we forked the
>> > processor,
>> > > > but
>> > > > > > now
>> > > > > > > DAG processor MUST NOT use the DB, so basically it looks like
>> we
>> > > leak
>> > > > > the
>> > > > > > > DB to the processor. Which is also yet another security issue
>> -
>> > if
>> > > > > > > scheduler has a way to initialize the database, it means it
>> has
>> > > > access
>> > > > > to
>> > > > > > > the database credentials, and it also means that unless we
>> > involve
>> > > > some
>> > > > > > > kind of cgroups docker-like process separation, such forked
>> DAG
>> > > > > processor
>> > > > > > > (and this also means DAG author) can access those credentials,
>> > and
>> > > > > access
>> > > > > > > database. This means pretty much that embedded DAG processor
>> > simply
>> > > > > > breaks
>> > > > > > > the "no DB access by DAG author" assumptions of Airflow 3.
>> > > > > > >
>> > > > > > > J.
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, Jan 10, 2025 at 8:03 AM Kaxil Naik <
>> kaxiln...@gmail.com>
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > +1
>> > > > > > > >
>> > > > > > > > On Fri, 10 Jan 2025 at 07:43, Mehta, Shubham
>> > > > > <shu...@amazon.com.invalid
>> > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > + 1 on this as well. From what I have seen, standalone DAG
>> > > > > processing
>> > > > > > > > > results in a minor performance advantage and, importantly,
>> > > makes
>> > > > > the
>> > > > > > > > > Scheduler loop more resilient to DAG processor crashes.
>> > > > > > > > >
>> > > > > > > > > Shubham
>> > > > > > > > >
>> > > > > > > > > On 2025-01-09, 4:02 PM, "Daniel Imberman" <
>> > > > > > daniel.imber...@gmail.com
>> > > > > > > > > <mailto:daniel.imber...@gmail.com>> wrote:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > CAUTION: This email originated from outside of the
>> > > organization.
>> > > > Do
>> > > > > > not
>> > > > > > > > > click links or open attachments unless you can confirm the
>> > > sender
>> > > > > and
>> > > > > > > > know
>> > > > > > > > > the content is safe.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un
>> > > expéditeur
>> > > > > > > externe.
>> > > > > > > > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe
>> si
>> > > vous
>> > > > > ne
>> > > > > > > > pouvez
>> > > > > > > > > pas confirmer l’identité de l’expéditeur et si vous n’êtes
>> > pas
>> > > > > > certain
>> > > > > > > > que
>> > > > > > > > > le contenu ne présente aucun risque.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm +1 on this.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > The fact that there's one more thing to deploy isn't that
>> big
>> > > of
>> > > > an
>> > > > > > > issue
>> > > > > > > > > given the number of pre-configurable options mentioned
>> (e.g.
>> > > > helm)
>> > > > > > and
>> > > > > > > a
>> > > > > > > > > full logical separation of DAG parsing and scheduling
>> makes
>> > > sense
>> > > > > > (one
>> > > > > > > > > thing that has been a longstanding issue with Airflow is
>> the
>> > > > > > scheduler
>> > > > > > > > > "Doing too many things", so it would be nice to create a
>> > clean
>> > > > > divide
>> > > > > > > > > here).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Jan 9, 2025 at 3:28 PM Jed Cunningham
>> > > > > <j...@astronomer.io.inva
>> > > > > > > > > <mailto:j...@astronomer.io.inva>lid>
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Hello everyone!
>> > > > > > > > > >
>> > > > > > > > > > As I've been working on parsing lately, I want to
>> propose a
>> > > > > change
>> > > > > > in
>> > > > > > > > > that
>> > > > > > > > > > area in time for Airflow 3.
>> > > > > > > > > >
>> > > > > > > > > > Today there are 2 different ways the DAG processor can
>> be
>> > run
>> > > > in
>> > > > > > > > Airflow
>> > > > > > > > > -
>> > > > > > > > > > as a standalone component, or embedded in the scheduler.
>> > The
>> > > > > > > standalone
>> > > > > > > > > > option came in 2.3, prior to that the only option was it
>> > > being
>> > > > > > > embedded
>> > > > > > > > > in
>> > > > > > > > > > the scheduler.
>> > > > > > > > > >
>> > > > > > > > > > Why standalone? Generally speaking, parsing scales
>> > vertically
>> > > > > > (single
>> > > > > > > > > loop
>> > > > > > > > > > - more concurrent parsing) while scheduling is scaled
>> > > > > horizontally
>> > > > > > > > (many
>> > > > > > > > > > loops). As the DAG processor and scheduler scale in
>> > different
>> > > > > > > manners,
>> > > > > > > > > it's
>> > > > > > > > > > awkward to have them live in the same component. There
>> is
>> > > also
>> > > > a
>> > > > > > > > > resiliency
>> > > > > > > > > > aspect here, no noisy neighbor issues.
>> > > > > > > > > >
>> > > > > > > > > > Really, the only positive of the embedded option is that
>> > it's
>> > > > > > easier
>> > > > > > > to
>> > > > > > > > > > deploy, as there is 1 less component to worry about.
>> > However,
>> > > > we
>> > > > > > > > already
>> > > > > > > > > > have a number of components, so 1 more isn't that
>> > cumbersome.
>> > > > > > > Everyone
>> > > > > > > > > > using breeze, standalone, the helm chart, a vendor,
>> won't
>> > be
>> > > > > > impacted
>> > > > > > > > > much
>> > > > > > > > > > by this change - in fact, having the log stream separate
>> > is a
>> > > > big
>> > > > > > > > > positive!
>> > > > > > > > > >
>> > > > > > > > > > We'd also be able to remove a bit of complexity around
>> > > > > > > reinitialising a
>> > > > > > > > > > bunch of stuff in the child process.
>> > > > > > > > > >
>> > > > > > > > > > Overall, I see primarily positives with this change,
>> and a
>> > > > major
>> > > > > > > > version
>> > > > > > > > > > upgrade is the perfect time to simplify this part of
>> > Airflow.
>> > > > > > > Thoughts?
>> > > > > > > > > >
>> > > > > > > > > > Jed
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to