Re: [Discussion] Make the scheduler's task selection algorithm pluggable

asquator Sat, 20 Sep 2025 19:08:16 -0700

Re Daniel:
Pushing the logic to sprocs is a brave venture indeed. The idea came to me as a 
last resort after a long frustration with different techniques. We do it 
because we've seemingly tried everything else, and _failed miserably_. Ofc 
there is a big maintenance question, and I'll reserve my detailed comment 
before I do more improvements. It IS a tougher sell, but compared to our 
previous attempt with WFs I see three improvements here:
1. Maintainability (adding new limits)
2. Performance
3. Solved cases
I will notify on the main mailing list devoted to starvation when it's solid 
and ready for testing. The PR is already open for reviews.


Re Jarek:
Thanks for the direction, we'll do the best to follow the path of incremental 
changes and hopefully arrive at a *neat* solution in the end.


On Tuesday, September 16th, 2025 at 8:42 PM, Daniel Standish via dev 
<[email protected]> wrote:

> I agree with Jarek's comments.
> 
> Re some of your comments...
> 
> Scheduler can be improved. Scheduler should be improved. Incremental is
> good.
> 
> Nobody is going to dispute that breaking things up into components that are
> easier to understand and maintain is a good goal.
> 
> Nobody is going to object to running more stuff in parallel, if it is
> feasible and produces better performance, is a good thing.
> 
> Re your specific PR
> 
> I think that your idea to push the logic down to stored procs though is a
> tougher sell. For one, you need to add a different sproc per dialect. And
> the length of the one sproc you wrote is about the same as the long
> function that you are trying to replace. SQL is also a worse language for
> intelligibility and modularization so it goes in the opposite direction.
> Your sproc is tougher to inspect than the function it replaces IMHO.
> 
> We obviously cannot write off the idea of using sprocs and pushing more
> logic to sql. But there are real tradeoffs and the case has to be made
> that it's worth it. The sproc approach does appear to be worse to
> maintain, tougher to test, and more error prone. And to make changes
> requires a deployment --- not just patching code. So, while a case could
> be made that it is worth it, the case has to be made.
> 
> 
> On Tue, Sep 16, 2025 at 10:33 AM Jarek Potiuk [email protected] wrote:
> 
> > It's just the very basic thing here is 'community over code' - wy believe
> > that good code is result of building great community that works together.
> > And that the opposite is not necessarily true.
> > 
> > In this case I identify the issue not as 'code is bad' but as `let's build
> > community of people collaborating on making it better' - where careful
> > refactoring with participation od multiple parties and especially building
> > on all the experience and accumulated years of wisdom of people who worked
> > on it is crucial. Because individuals who have even best intentions to make
> > things 'better' might just miss a lot of context (like for example the
> > open-source context where we have to be careful what we expose to users -
> > lessons we learned hard over the many years Airflow is maintained by us.
> > 
> > And it is simply rather closing for cooperation rather than opening if we
> > start with goal that is too ambitious and starts with "let's rip it all out
> > it's badly designed" vibe to it.
> > 
> > I think it's better to see the modularity / customisability as potentially
> > good end goal and start with incremental improvements that will also
> > address the imperfect explanation of context and decisions and get people
> > enthused and collaborating on it.
> > 
> > J.
> > 
> > wt., 16 wrz 2025, 11:13 użytkownik asquator [email protected] napisał:
> > 
> > > Well, it's always new contributors who criticize the old code, because
> > > they're affected the most.
> > > It's not about blaming people, but noticing things "at the margin",
> > > ignoring the history and proposing
> > > changes that are objectively beneficial at the moment. I did watch
> > > Ash's
> > > talk before starting to work
> > > on the starvation issue, and it's great. I have nothing to say about the
> > > algorithm and the logic
> > > (mb except it should be asynchronous), but the code structure. Over the
> > > years, small patches have been added
> > > to one file, and now it has exploded. It's normal. It happens. We just
> > > have to realize that and start
> > > splitting the code, without drastically changing the logic.
> > > 
> > > Practice shows that modular code is always better than monolithic scripts
> > > with multiple responsibilities.
> > > I don't say abstract every small part of the code. The "cheapest" way to
> > > decide which parts should
> > > be abstracted is the simple principle of "if the component has to be
> > > replaced once, it can be replaced again in the future".
> > > As we're working on the critical section, we're ready to do the extra
> > > work
> > > to abstract it out and make it
> > > pluggable (for developers) - it can be the first step in modularizing the
> > > scheduler which we're ready to take.
> > > 
> > > As I have to implement the PR, I'm willing to dive into the tactical
> > > level
> > > and reach a compromise with the PMCs
> > > (who know this project's internals much better) regarding what's the best
> > > option to inject things internally.
> > > It's a purely technical problem I have to solve to have a working PR. I
> > > will do the subclass for testing because
> > > it's quick and dirty, but it would be nice to come to a better solution
> > > (it shouldn't be too hard) so we can
> > > do it the best way before it's (hopefully) merged.
> > > 
> > > On Tuesday, September 16th, 2025 at 7:08 PM, Jarek Potiuk <
> > > [email protected]> wrote:
> > > 
> > > > I think this is something we should seriously discuss after 3.1 as many
> > > > people are busy now with tying up loose ends.
> > > > 
> > > > But I would be generally in favour of looking at Scheduler logic,
> > > > modularising it and making it easier to reason about. I do not want to
> > > > (absolutely) say that it's bad or "should have been done better" or
> > > > anything to say always modular code is best (which often might be
> > > > implied
> > > > as "someone did a bad job here"). This is absolutely not that. It's
> > > > very
> > > > easy to criticise things (even subconsciously) when you come out from
> > > > outside and you see how things "should have been" - but without all the
> > > > context and history, this is often a bad messaging to those who spent
> > > > years
> > > > on making thing work reliably in production for tens of thousands of
> > > > users
> > > > and being generally stable and one of the most reliable part of Airflow
> > > > "core".
> > > > 
> > > > There are PLENTY of reasons for scheduler being implemented the way it
> > > > is -
> > > > and even trying to approach explaining the history and decision making
> > > > process would take a lot of time. And currently (for good reasons) the
> > > > scheduler "API" is exactly what Ash explained.
> > > > 
> > > > But there is no reason we should not think and make a concerted (and
> > > > group)
> > > > effort to modularise it and make scheduler easier to reason about and -
> > > > importantly - being easier for more people to contribute to and discuss
> > > > it
> > > > - and most importantly - adding way more modularised tests that would
> > > > also
> > > > allow us to break it up and tests parts of it - also for performance
> > > > and
> > > > behaviour characteristics. It's not well suited for it today, but -
> > > > possibly - it could. And as the most important part of Airflow, we
> > > > should make it easier to understand, reason about and contribute to by
> > > > many
> > > > people. Simply now there are probably just a few people (Ash being the
> > > > main
> > > > person) that can reason and discuss some intrinsic scheduler insights.
> > > > And
> > > > If we want to make Airflow sustainable, we should make it easier for
> > > > others
> > > > to understand and contribute to it.
> > > > 
> > > > One of the things as a result of it - I would love to have it better
> > > > documented and explained the reasoning behind some decisions and
> > > > explaining
> > > > how it works (that might be a result of such a concerted effort).
> > > > 
> > > > It's a similar story as with Ci/CD Breeze two years ago - I was the
> > > > only
> > > > one who really could reason about it but through rewriting in Python
> > > > and documenting with ADRs
> > > > https://github.com/apache/airflow/tree/main/dev/breeze/doc/adr which
> > > > still
> > > > describe some basic assumptions there and engaging others, modularising
> > > > stuff and getting them to participate I can go now for 3 weeks
> > > > vacations
> > > > knowing that things will be taken care of, no matter what (Which BTW. I
> > > > am
> > > > doing now).
> > > > 
> > > > The "why" and "how" scheduler works is not really documented. There is
> > > > this
> > > > fantastic talk by Ash https://www.youtube.com/watch?v=DYC4-xElccE
> > > > which
> > > > still holds and explains it, but I would love to be able to reason and
> > > > discuss more about it - looking at both code and docs - without
> > > > reverse-engineering stuff.
> > > > 
> > > > But I think the goal should be "modularising first" - maybe resulting
> > > > later in easier way of replacing pieces of scheduler, so the
> > > > modularising
> > > > effort should be guided by the current PRs and ways they are trying to
> > > > address starvation for example. Doing it slowly, with mutliple people
> > > > reviewing, learning and contributing (and documentation created on the
> > > > way).
> > > > 
> > > > I think that should be our initial goal ... then maybe things will
> > > > follow.
> > > > 
> > > > J.
> > > > 
> > > > On Tue, Sep 16, 2025 at 8:02 AM asquator [email protected] wrote:
> > > > 
> > > > > will be undocumented*
> > > > > 
> > > > > On Tuesday, September 16th, 2025 at 5:01 PM, asquator
> > > > > [email protected]
> > > > > wrote:
> > > > > 
> > > > > > I see the motivation, but does it have to look so bad?
> > > > > > 
> > > > > > The subclass will look like this:
> > > > > > 
> > > > > > class SchedulerJobRunnerLinearTIScan(SchedulerJobRunner):
> > > > > > def init(
> > > > > > self,
> > > > > > job: Job,
> > > > > > num_runs: int = conf.getint("scheduler", "num_runs"),
> > > > > > scheduler_idle_sleep_time: float = conf.getfloat("scheduler",
> > > > > > "scheduler_idle_sleep_time"),
> > > > > > log: Logger | None = None,
> > > > > > ):
> > > > > > super().init(
> > > > > > job=job,
> > > > > > num_runs=num_runs,
> > > > > > scheduler_idle_sleep_time=scheduler_idle_sleep_time,
> > > > > > log=log,
> > > > > > )
> > > > > > self.task_selector = TASK_SELECTORS[LINEAR_SCAN_SELECTOR]
> > > > > > 
> > > > > > The super class will use the injected hard-coded task selector.
> > > > > > 
> > > > > > Can't we introduce a configuration hierarchy like `core.internal`
> > > > > > and
> > > > > > put there things not exposed to the end user? So we don't have to
> > > > > > do
> > > > > > this
> > > > > > weird subclassing?
> > > > > > 
> > > > > > It will look thus:
> > > > > > 
> > > > > > class SchedulerJobRunner(...):
> > > > > > task_selector_type = conf.get("scheduler.internal",
> > > > > > "task_selector_strategy")
> > > > > > self.task_selector = TASK_SELECTORS[task_selector_type]
> > > > > > 
> > > > > > We'd just like to have an internal toggle as an implementation
> > > > > > detail,
> > > > > > which won't be undocumented and custom implementations won't be
> > > > > > supported.
> > > > > > It's just more convenient and straightforward.
> > > > > > 
> > > > > > Mb there's another way of internal settings management I missed?
> > > > > > 
> > > > > > On Tuesday, September 16th, 2025 at 11:34 AM, Ash Berlin-Taylor
> > > > > > [email protected] wrote:
> > > > > > 
> > > > > > > > On 16 Sep 2025, at 08:58, asquator [email protected] wrote:
> > > > > > > > 
> > > > > > > > Yes, exposing pluggable features means fixing an API, which is
> > > > > > > > confining and just hard to do given the current implementation
> > > > > > > 
> > > > > > > class MyScheduler:
> > > > > > > def execute(self):
> > > > > > > while True:
> > > > > > > # Do what ever you want.
> > > > > > > 
> > > > > > > `airflow scheduler --impl=my.module.MyScheduler`
> > > > > > > 
> > > > > > > That is the API.
> > > > > > > 
> > > > > > > That is as pluggable as we need it to be.
> > > > > > > 
> > > > > > > Everything can be built on top of that, including if you want
> > > > > > > it, a
> > > > > > > pluggable task selection mechanisms.
> > > > > > > 
> > > > > > > Airflow already has too many config options and ways of tuning
> > > > > > > behaviour. We need less of them, not more.
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Discussion] Make the scheduler's task selection algorithm pluggable

Reply via email to