Re: [Discussion] Make the scheduler's task selection algorithm pluggable

Jarek Potiuk Tue, 16 Sep 2025 10:34:00 -0700

It's just the very basic thing here is 'community over code' - wy believe
that good code is result of building great community that works together.
And that the opposite is not necessarily true.


In this case I identify the issue not as 'code is bad' but as `let's build
community of people collaborating on making it better' - where careful
refactoring with participation od multiple parties and especially building
on all the experience and accumulated years of wisdom of people who worked
on it is crucial. Because individuals who have even best intentions to make
things 'better' might just miss a lot of context (like for example the
open-source context where we have to be careful what we expose to users -
lessons we learned hard over the many years Airflow is maintained by us.

And it is simply rather closing for cooperation rather than opening if we
start with goal that is too ambitious and starts with "let's rip it all out
it's badly designed" vibe to it.

I think it's better to see the modularity / customisability as potentially
good end goal and start with incremental improvements that will also
address the imperfect explanation of context and decisions and get people
enthused and collaborating on it.

J.

wt., 16 wrz 2025, 11:13 użytkownik asquator <[email protected]> napisał:

> Well, it's always new contributors who criticize the old code, because
> they're affected the most.
> It's not about blaming people, but noticing things "at the margin",
> ignoring the history and proposing
> changes that are objectively beneficial *at the moment*. I did watch Ash's
> talk before starting to work
> on the starvation issue, and it's great. I have nothing to say about the
> algorithm and the logic
> (mb except it should be asynchronous), but the code structure. Over the
> years, small patches have been added
> to one file, and now it has exploded. It's normal. It happens. We just
> have to realize that and start
> splitting the code, without drastically changing the logic.
>
> Practice shows that modular code is always better than monolithic scripts
> with multiple responsibilities.
> I don't say abstract every small part of the code. The "cheapest" way to
> decide which parts should
> be abstracted is the simple principle of "if the component has to be
> replaced once, it can be replaced again in the future".
> As we're working on the critical section, we're ready to do the extra work
> to abstract it out and make it
> pluggable (for developers) - it can be the first step in modularizing the
> scheduler which we're ready to take.
>
> As I have to implement the PR, I'm willing to dive into the tactical level
> and reach a compromise with the PMCs
> (who know this project's internals much better) regarding what's the best
> option to inject things internally.
> It's a purely technical problem I have to solve to have a working PR. I
> *will* do the subclass for testing because
> it's quick and dirty, but it would be nice to come to a better solution
> (it shouldn't be too hard) so we can
> do it the best way before it's (hopefully) merged.
>
>
>
> On Tuesday, September 16th, 2025 at 7:08 PM, Jarek Potiuk <
> [email protected]> wrote:
>
> > I think this is something we should seriously discuss after 3.1 as many
> > people are busy now with tying up loose ends.
> >
> > But I would be generally in favour of looking at Scheduler logic,
> > modularising it and making it easier to reason about. I do not want to
> > (absolutely) say that it's bad or "should have been done better" or
> > anything to say always modular code is best (which often might be implied
> > as "someone did a bad job here"). This is absolutely not that. It's very
> > easy to criticise things (even subconsciously) when you come out from
> > outside and you see how things "should have been" - but without all the
> > context and history, this is often a bad messaging to those who spent
> years
> > on making thing work reliably in production for tens of thousands of
> users
> > and being generally stable and one of the most reliable part of Airflow
> > "core".
> >
> > There are PLENTY of reasons for scheduler being implemented the way it
> is -
> > and even trying to approach explaining the history and decision making
> > process would take a lot of time. And currently (for good reasons) the
> > scheduler "API" is exactly what Ash explained.
> >
> > But there is no reason we should not think and make a concerted (and
> group)
> > effort to modularise it and make scheduler easier to reason about and -
> > importantly - being easier for more people to contribute to and discuss
> it
> > - and most importantly - adding way more modularised tests that would
> also
> > allow us to break it up and tests parts of it - also for performance and
> > behaviour characteristics. It's not well suited for it today, but -
> > possibly - it could. And as the most important part of Airflow, we
> > should make it easier to understand, reason about and contribute to by
> many
> > people. Simply now there are probably just a few people (Ash being the
> main
> > person) that can reason and discuss some intrinsic scheduler insights.
> And
> > If we want to make Airflow sustainable, we should make it easier for
> others
> > to understand and contribute to it.
> >
> > One of the things as a result of it - I would love to have it better
> > documented and explained the reasoning behind some decisions and
> explaining
> > how it works (that might be a result of such a concerted effort).
> >
> > It's a similar story as with Ci/CD Breeze two years ago - I was the only
> > one who really could reason about it but through rewriting in Python
> > and documenting with ADRs
> > https://github.com/apache/airflow/tree/main/dev/breeze/doc/adr which
> still
> > describe some basic assumptions there and engaging others, modularising
> > stuff and getting them to participate I can go now for 3 weeks vacations
> > knowing that things will be taken care of, no matter what (Which BTW. I
> am
> > doing now).
> >
> > The "why" and "how" scheduler works is not really documented. There is
> this
> > fantastic talk by Ash https://www.youtube.com/watch?v=DYC4-xElccE which
> > still holds and explains it, but I would love to be able to reason and
> > discuss more about it - looking at both code and docs - without
> > reverse-engineering stuff.
> >
> > But I think the goal should be "modularising first" - maybe resulting
> > later in easier way of replacing pieces of scheduler, so the modularising
> > effort should be guided by the current PRs and ways they are trying to
> > address starvation for example. Doing it slowly, with mutliple people
> > reviewing, learning and contributing (and documentation created on the
> way).
> >
> > I think that should be our initial goal ... then maybe things will
> > follow.
> >
> > J.
> >
> >
> >
> > On Tue, Sep 16, 2025 at 8:02 AM asquator [email protected] wrote:
> >
> > > will be undocumented*
> > >
> > > On Tuesday, September 16th, 2025 at 5:01 PM, asquator
> [email protected]
> > > wrote:
> > >
> > > > I see the motivation, but does it have to look so bad?
> > > >
> > > > The subclass will look like this:
> > > >
> > > > class SchedulerJobRunnerLinearTIScan(SchedulerJobRunner):
> > > > def init(
> > > > self,
> > > > job: Job,
> > > > num_runs: int = conf.getint("scheduler", "num_runs"),
> > > > scheduler_idle_sleep_time: float = conf.getfloat("scheduler",
> > > > "scheduler_idle_sleep_time"),
> > > > log: Logger | None = None,
> > > > ):
> > > > super().init(
> > > > job=job,
> > > > num_runs=num_runs,
> > > > scheduler_idle_sleep_time=scheduler_idle_sleep_time,
> > > > log=log,
> > > > )
> > > > self.task_selector = TASK_SELECTORS[LINEAR_SCAN_SELECTOR]
> > > >
> > > > The super class will use the injected hard-coded task selector.
> > > >
> > > > Can't we introduce a configuration hierarchy like `core.internal` and
> > > > put there things not exposed to the end user? So we don't have to do
> this
> > > > weird subclassing?
> > > >
> > > > It will look thus:
> > > >
> > > > class SchedulerJobRunner(...):
> > > > task_selector_type = conf.get("scheduler.internal",
> > > > "task_selector_strategy")
> > > > self.task_selector = TASK_SELECTORS[task_selector_type]
> > > >
> > > > We'd just like to have an internal toggle as an implementation
> detail,
> > > > which won't be undocumented and custom implementations won't be
> supported.
> > > > It's just more convenient and straightforward.
> > > >
> > > > Mb there's another way of internal settings management I missed?
> > > >
> > > > On Tuesday, September 16th, 2025 at 11:34 AM, Ash Berlin-Taylor
> > > > [email protected] wrote:
> > > >
> > > > > > On 16 Sep 2025, at 08:58, asquator [email protected] wrote:
> > > > > >
> > > > > > Yes, exposing pluggable features means fixing an API, which is
> > > > > > confining and just hard to do given the current implementation
> > > > >
> > > > > class MyScheduler:
> > > > > def execute(self):
> > > > > while True:
> > > > > # Do what ever you want.
> > > > >
> > > > > `airflow scheduler --impl=my.module.MyScheduler`
> > > > >
> > > > > That is the API.
> > > > >
> > > > > That is as pluggable as we need it to be.
> > > > >
> > > > > Everything can be built on top of that, including if you want it, a
> > > > > pluggable task selection mechanisms.
> > > > >
> > > > > Airflow already has too many config options and ways of tuning
> > > > > behaviour. We need less of them, not more.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [Discussion] Make the scheduler's task selection algorithm pluggable

Reply via email to