Re: [Discussion] Make the scheduler's task selection algorithm pluggable

asquator Mon, 15 Sep 2025 20:12:53 -0700

Hello!

First some updates regarding the #54392 PR:
Contributions to the PR have been halted. See the PR itself for more 
information.
A new PR was opened to address the general problem of starvation, utilizing 
stored SQL functions/procedures and any reviews are welcome:
[https://github.com/apache/airflow/pull/55537](https://github.com/apache/airflow/pull/55537)

My position on pluggable scheduler is that every piece of software, especially 
complex software *must* be split into smaller, independent components which are 
made pluggable whether internally (bootstrap files) or configurations. It has 
been said above that the scheduler's code is exceptionally "complex", and I 
completely disagree with that. It's not complex but cumbersome, dirty, 
overloaded and highly monolithic. We have a function called 
_executable_task_instances_to_queued having 355 (!) lines and 4 (!) levels of 
nesting. This opposes ANY normal clean code standards which is kind of... BAD. 
This is what makes the scheduler "complex", difficult to change, and difficult 
for newcomers to step into. This was just one example, but the entire class is 
written like that. Sometimes I have a feeling it has been intentionally 
sabotaged to look this way, and it's sad.

> Roughly speaking the scheduler has three main responsibilities

Exactly! This is a big problem for the SRP. The scheduler should be a facade 
that just triggers different steps, instead of one large incomprehensible 
`while True: do_everything()` script as it looks now. IMO the independent steps 
should even run asynchronously instead of current sequential execution. It will 
both make the code cleaner and produce more efficient results. One class should 
not do "three main responsibilities". Never. Over time the industry 
requirements will shift towards running millions and tens of millions of tasks 
daily, and new solutions will be required to support these requirements. The 
way things go today, it will be *very* hard to introduce global changes. The 
scheduler code looks "complex" because it was made so. Inherently it's a very 
simple logic - query the tasks, loop over them and log some stuff, we just have 
too much detail in one file and it's *frustrating*. For the sake of the SRP I 
think we *must* split the scheduler one day, and any friction blocking this 
refactor is another nail in the project's maintainability coffin.

A complete refactor will be a hard thing to do, so incremental changes are much 
more feasible to introduce. Task selection logic is an important part that 
should be taken out to another component. Here we both fix the starvation and 
do a good thing for the project instead of burying it even deeper.  

---

Now that we're done with the clean code topic, let's talk about the maintenance 
overhead so feared by maintainers.
I claim that plugin architecture *does not* inherently mean more effort to 
support any kind of community implementations.

> There is absolutely no way we can make it available for users to override and 
> use their own implementation - because we will have to support whatever 
> someone implemented.

No. This is absolutely wrong. We won't accept any kind of implementation that 
solves some specific edge case - not at all. The main branch will include just 
one (at most two) generally accepted and tested implementations. If someone 
feels like writing their own version - let them do that in their fork for their 
business needs.  
It should *never* be in main until it's useful for the entire community. If 
someone needs their specific behavior - let them do it, we won't support it as 
it's in their fork. Plugin architecture means *the ability* to quickly change a 
subcomponent to another one, not *the necessity* to support all kinds of 
plugins. We just define a single API and stick to it. We've been researching 
the starvation problem for half a year now and tried all kinds of fixes. Until 
the component is pluggable, it was a *real pain* to check something new.

Let's connect it to our case:  
We have the #54284 PR which is designed to solve a particular issue @dstandish 
described. If this logic solves the problem for them, I have no objection to 
their adoption of this strategy as a custom plugin. I don't see how it can be 
merged into main, because they did a very particular fix that won't work for 
everybody - it will be a burden for the devs, but may be a salvation for their 
team. My position here is making it easy for them to switch to this strategy 
using plugin architecture, without ever taking responsibility for their code. 
My team experienced a similar issue but for pools instead of DAGs. We've been 
considering creating a patch like #54284, but we dug deeper and found the root 
of the problem, so this patch was never created. I agree, we shouldn't pollute 
the repo with small patches - it will be hell.

We also have the #55537 PR which is designed to solve the issue *for everyone*. 
As this implementation claims to replace the current, optimistic scheduler 
(claiming to be "just better"), I think it can certainly coexist with the 
optimistic for a release or so. The steps are:  
1. Testing and benchmarks outside the main tree (by enthusiasts)  
2. Merging and wide testing by the community, with the ability to switch back 
on failure  
3. Deprecation of the optimistic strategy in case the new strategy is really 
"just better"  

To be honest, I don't care at all if the testing is done out of main (it's 
reasonable), but IMO the second step is still desirable because we cannot 
expect everyone to test their workflows with the new strategy in the fork. It 
implies switching repos, redeploying the chart and doing many unnecessary 
steps. A configuration is much simpler (remember, the new strategy is in main 
only after preliminary testing shows good results). It's just another safety 
step to decrease the chance of breaking people's production workflows, as a 
core component is changed. Regarding subclassing `SchedulerJobRunner` - it's a 
very bad practice. There's absolutely no reason to subclass the entire job 
class to swap one single component. It's just cumbersome and requires splitting 
this poor "god class" to even smaller methods nobody understands. If we decide 
to NOT test the new strategy in main but just replace the current one (I say 
it's less safe, but possible), then it shouldn't bother us at all ATP - whether 
it's a subclass or a configuration - as it will be taken down anyway.  
We have to focus on finding a good strategy to become the main one, benchmark 
it and understand the implications of switching to it - I hope #55537 may be a 
good candidate.

---

Regarding research papers - I don't think it's so hard to find a strategy that 
just works for all cases. From an academic viewpoint, we have a *very simple* 
case of non-preemptible single-trigger scheduling with priorities that can be 
solved with *one sort and a linear scan*. This is basically an entry level 
leetcode problem. The main difficulty was to find something that works in our 
case considering:
1. The code is in Python
2. The tasks are in SQL
and giving the best performance with fewer network hops.
I can say we had a great progress, and I'll give a broader description of the 
new approach we're trying now in a corresponding mail topic later.

---

TL;DR:  
A separation of concerns is highly desired for the scheduler and we should make 
it BETTER, not WORSE.  
Pluggability is a good thing so everyone can inject things of their own.  
We won't support all kinds of community scheduling strategies in the main tree, 
to clarify - we won't support any, except the one working well in all cases.  
If we test outside of the main repo, we shouldn't care how the strategy is 
selected, but inheritance is a messy approach and a pretty bad pattern here.  
Let's focus on solving starvation, and just do the coding right, adhering to 
SRP and minimizing the maintenance burden. 

On Monday, September 15th, 2025 at 10:23 PM, Natanel <[email protected]> 
wrote:

> Hello.
> 
> Me and Asquator have already been through this issue, and we have, what we
> think, is a decent implementation of pluggable task selection algorithm for
> airflow.
> (which we have implemented here
> https://github.com/Asquator/airflow/tree/feature/pessimistic-task-fetching-with-window-function
> )
> 
> I agree that no perfect solution will ever exist in airflow for all use
> cases, regarding task selection, which is why this is probably a necessity
> more than a Nice To Have feature.
> 
> In the current way we implemented it, we can have a few pre implemented
> algorithms, that solve different issues, as not all users will encounter
> all issues, and by making them pluggable correctly, with a configuration,
> we can include the documentation on when to use a specific task selection
> algorithm, just like Jarek Potiuk proposed. it will not be customizable,
> but rather injectable inside of the airflow-core package.
> 
> Of course there are risks that come along with it, like users abusing it
> and trying to create a new task selection algorithm for each edge case or
> use case they have, which can become hard to maintain and follow, however,
> I do not agree that it makes it harder to maintain (in terms of code
> amount), or easier to make mistakes, though, if implemented correctly, each
> task selector is independent, and acts as a black box, has a simple api,
> and can be interchanged without any code changes, which makes it, in my
> opinion, easier to maintain existing algorithms, and removes the need to
> change a single big and sloppy file (as it is now).
> In fact, I am certain that making it pluggable will simplify the scheduler
> altogether as now, different parts will be clearly separated in different
> files and directories.
> 
> Allowing the injectable algorithms, does give more flexibility, and can
> even make adding the new priority weights algorithm quite simple, and not
> cause any massive changes.
> 
> The main downside is that we have to choose an api very carefully, as when
> we add it, it will be exceptionally hard to change it, as it would mean
> changing it in multiple places, and so it would be considered a breaking
> change.
> 
> 
> On Mon, 1 Sept 2025 at 18:36, Christos Bisias [email protected] wrote:
> 
> > Hello,
> > 
> > A while back I started a discussion on the mailing list regarding making
> > some changes to the task selection query in order to improve the
> > scheduler's throughput.
> > 
> > https://github.com/apache/airflow/pull/54103
> > 
> > Another topic came up during that discussion related to task starvation due
> > to the current selection algorithm. There are two open PRs with different
> > fixes for that issue.
> > 
> > https://github.com/apache/airflow/pull/54284
> > 
> > https://github.com/apache/airflow/pull/53492
> > 
> > Everyone has his own needs and it's probable that a good number of users
> > won't experience the starvation issue.
> > 
> > Each approach has its own advantages and disadvantages and for that reason
> > it doesn't feel like there is a right or wrong approach here or a single
> > solution for all.
> > 
> > There have been papers on task selection algorithms like this one
> > 
> > https://ieeexplore.ieee.org/document/9799199
> > 
> > I would like to suggest refactoring the scheduler so that the task
> > selection algorithm can be pluggable. The current implementation will be
> > the default. Everyone will be able to configure the path to his own class.
> > That will be the most beneficial to the majority of users.
> > 
> > In the future, anyone could create a PR with his implementation and if
> > enough people like it, it could be added to the repo.
> > 
> > This has already been done for the priority weights algorithm, so why not
> > in this case as well?
> > 
> > https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html#custom-weight-rule
> > 
> > If there is positive feedback on this idea, I would like to implement it.
> > 
> > Please let me know what you think. Thank you!
> > 
> > Regards,
> > Christos

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Discussion] Make the scheduler's task selection algorithm pluggable

Reply via email to