Re: [Discussion] Make the scheduler's task selection algorithm pluggable

Jarek Potiuk Wed, 17 Sep 2025 18:53:30 -0700

I think this is something we should seriously discuss after 3.1 as many
people are busy now with tying up loose ends.

But I would be generally in favour of looking at Scheduler logic,
modularising it and making it easier to reason about. I do not want to
(absolutely) say that it's bad or "should have been done better" or
anything to say always modular code is best (which often might be implied
as "someone did a bad job here"). This is absolutely  not that. It's very
easy to criticise things (even subconsciously) when you come out from
outside and you see how things "should have been" - but without all the
context and history, this is often a bad messaging to those who spent years
on making thing work reliably in production for tens of thousands of users
and being generally stable and one of the most reliable part of Airflow
"core".

There are PLENTY of reasons for scheduler being implemented the way it is -
and even trying to approach explaining the history and decision making
process would take a lot of time. And currently (for good reasons) the
scheduler "API" is exactly what Ash explained.

But there is no reason we should not think and make a concerted (and group)
effort to modularise it and make scheduler easier to reason about and -
importantly - being easier for more people to contribute to and discuss it
- and most importantly - adding way more modularised tests that would also
allow us to break it up and tests parts of it - also for performance and
behaviour characteristics. It's not well suited for it today, but -
possibly - it could. And as the **most** important part of Airflow, we
should make it easier to understand, reason about and contribute to by many
people. Simply now there are probably just a few people (Ash being the main
person) that can reason and discuss some intrinsic scheduler insights. And
If we want to make Airflow sustainable, we should make it easier for others
to understand and contribute to it.

One of the things as a result of it - I would love to have it better
documented and explained the reasoning behind some decisions and explaining
how it works (that might be a result of such a concerted effort).

It's a similar story as with Ci/CD Breeze two years ago - I was the only
one who **really** could reason about it but through rewriting in Python
and documenting with ADRs
https://github.com/apache/airflow/tree/main/dev/breeze/doc/adr which still
describe some basic assumptions there and engaging others, modularising
stuff and getting them to participate I can go now for 3 weeks vacations
knowing that things will be taken care of, no matter what (Which BTW. I am
doing now).

The "why" and "how" scheduler works is not really documented. There is this
fantastic talk by Ash https://www.youtube.com/watch?v=DYC4-xElccE which
still holds and explains it, but I would love to be able to reason and
discuss more about it - looking at both code and docs - without
reverse-engineering stuff.

But I think the goal should be "modularising first" - *maybe* resulting
later in easier way of replacing pieces of scheduler, so the modularising
effort should be guided by the current PRs and ways they are trying to
address starvation for example. Doing it slowly, with mutliple people
reviewing, learning and contributing (and documentation created on the way).

I think *that* should be our initial goal ... then **maybe** things will
follow.

J.

On Tue, Sep 16, 2025 at 8:02 AM asquator <[email protected]> wrote:

> will be undocumented*
>
> On Tuesday, September 16th, 2025 at 5:01 PM, asquator <[email protected]>
> wrote:
>
> > I see the motivation, but does it have to look so bad?
> >
> > The subclass will look like this:
> >
> > class SchedulerJobRunnerLinearTIScan(SchedulerJobRunner):
> > def init(
> > self,
> > job: Job,
> > num_runs: int = conf.getint("scheduler", "num_runs"),
> > scheduler_idle_sleep_time: float = conf.getfloat("scheduler",
> "scheduler_idle_sleep_time"),
> > log: Logger | None = None,
> > ):
> > super().init(
> > job=job,
> > num_runs=num_runs,
> > scheduler_idle_sleep_time=scheduler_idle_sleep_time,
> > log=log,
> > )
> > self.task_selector = TASK_SELECTORS[LINEAR_SCAN_SELECTOR]
> >
> > The super class will use the injected hard-coded task selector.
> >
> > Can't we introduce a configuration hierarchy like `core.internal` and
> put there things not exposed to the end user? So we don't have to do this
> weird subclassing?
> >
> > It will look thus:
> >
> > class SchedulerJobRunner(...):
> > task_selector_type = conf.get("scheduler.internal",
> "task_selector_strategy")
> > self.task_selector = TASK_SELECTORS[task_selector_type]
> >
> > We'd just like to have an internal toggle as an implementation detail,
> which won't be undocumented and custom implementations won't be supported.
> It's just more convenient and straightforward.
> >
> > Mb there's another way of internal settings management I missed?
> >
> >
> > On Tuesday, September 16th, 2025 at 11:34 AM, Ash Berlin-Taylor
> [email protected] wrote:
> >
> > > > On 16 Sep 2025, at 08:58, asquator [email protected] wrote:
> > > >
> > > > Yes, exposing pluggable features means fixing an API, which is
> confining and just hard to do given the current implementation
> > >
> > > class MyScheduler:
> > > def execute(self):
> > > while True:
> > > # Do what ever you want.
> > >
> > > `airflow scheduler --impl=my.module.MyScheduler`
> > >
> > > That is the API.
> > >
> > > That is as pluggable as we need it to be.
> > >
> > > Everything can be built on top of that, including if you want it, a
> pluggable task selection mechanisms.
> > >
> > > Airflow already has too many config options and ways of tuning
> behaviour. We need less of them, not more.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [Discussion] Make the scheduler's task selection algorithm pluggable

Reply via email to