Re: [Discussion] Make the scheduler's task selection algorithm pluggable

asquator Wed, 17 Sep 2025 09:05:10 -0700

Yes, I see all this critique to come, and it's reasonable.
Can we think of "how good it has to be for it to be considered an option at 
all"?


Testability, SQL coupling and multiple implementations are the challenges we're 
facing now, and trying to address.

Actually tried to delay this idea to the very end, but in terms of efficiency, 
determinism and the "end experience" we haven't yet seen a better solution. I 
say we have to benchmark first and then weigh all the arguments for and against.

As we long as we're trying *something* and getting better, we're in good.
Let's try things.

Sent with Proton Mail secure email.

On Wednesday, September 17th, 2025 at 4:04 AM, Jarek Potiuk <ja...@potiuk.com> 
wrote:

> Side comment:
>
> > Pushing the logic to sprocs is a brave venture indeed.
>
>
> Oh My God. Stored procedures... Been there done that...
> Never, ever want to go back to that nightmare of debugging and issue
> solving... Days lost on trying to fix even the simplest issues .... It will
> be REALLY TOUGH SELL ....
>
> Speaking of best practices: coupling logic and database is as bad
> pattern as I can think of. We are even in the world of decoupling storage
> and computing for databases, and coming back to the world where part of the
> logic is in app and part in the DB does not seem good at all.
>
> I think any attempt we do to simplify things here should be easily testable
> and debuggable.. Not so with stored procedures. That's a huge disadvantage
> of stored procedures as far as I am concerned.
>
> J.
>
>
> On Tue, Sep 16, 2025 at 1:03 PM asquator asqua...@proton.me wrote:
>
> > Re Daniel:
> > Pushing the logic to sprocs is a brave venture indeed. The idea came to me
> > as a last resort after a long frustration with different techniques. We do
> > it because we've seemingly tried everything else, and failed miserably.
> > Ofc there is a big maintenance question, and I'll reserve my detailed
> > comment before I do more improvements. It IS a tougher sell, but compared
> > to our previous attempt with WFs I see three improvements here:
> > 1. Maintainability (adding new limits)
> > 2. Performance
> > 3. Solved cases
> > I will notify on the main mailing list devoted to starvation when it's
> > solid and ready for testing. The PR is already open for reviews.
> >
> > Re Jarek:
> > Thanks for the direction, we'll do the best to follow the path of
> > incremental changes and hopefully arrive at a neat solution in the end.
> >
> > On Tuesday, September 16th, 2025 at 8:42 PM, Daniel Standish via dev <
> > dev@airflow.apache.org> wrote:
> >
> > > I agree with Jarek's comments.
> > >
> > > Re some of your comments...
> > >
> > > Scheduler can be improved. Scheduler should be improved. Incremental is
> > > good.
> > >
> > > Nobody is going to dispute that breaking things up into components that
> > > are
> > > easier to understand and maintain is a good goal.
> > >
> > > Nobody is going to object to running more stuff in parallel, if it is
> > > feasible and produces better performance, is a good thing.
> > >
> > > Re your specific PR
> > >
> > > I think that your idea to push the logic down to stored procs though is a
> > > tougher sell. For one, you need to add a different sproc per dialect. And
> > > the length of the one sproc you wrote is about the same as the long
> > > function that you are trying to replace. SQL is also a worse language for
> > > intelligibility and modularization so it goes in the opposite direction.
> > > Your sproc is tougher to inspect than the function it replaces IMHO.
> > >
> > > We obviously cannot write off the idea of using sprocs and pushing more
> > > logic to sql. But there are real tradeoffs and the case has to be made
> > > that it's worth it. The sproc approach does appear to be worse to
> > > maintain, tougher to test, and more error prone. And to make changes
> > > requires a deployment --- not just patching code. So, while a case could
> > > be made that it is worth it, the case has to be made.
> > >
> > > On Tue, Sep 16, 2025 at 10:33 AM Jarek Potiuk ja...@potiuk.com wrote:
> > >
> > > > It's just the very basic thing here is 'community over code' - wy
> > > > believe
> > > > that good code is result of building great community that works
> > > > together.
> > > > And that the opposite is not necessarily true.
> > > >
> > > > In this case I identify the issue not as 'code is bad' but as `let's
> > > > build
> > > > community of people collaborating on making it better' - where careful
> > > > refactoring with participation od multiple parties and especially
> > > > building
> > > > on all the experience and accumulated years of wisdom of people who
> > > > worked
> > > > on it is crucial. Because individuals who have even best intentions to
> > > > make
> > > > things 'better' might just miss a lot of context (like for example the
> > > > open-source context where we have to be careful what we expose to
> > > > users -
> > > > lessons we learned hard over the many years Airflow is maintained by
> > > > us.
> > > >
> > > > And it is simply rather closing for cooperation rather than opening if
> > > > we
> > > > start with goal that is too ambitious and starts with "let's rip it
> > > > all out
> > > > it's badly designed" vibe to it.
> > > >
> > > > I think it's better to see the modularity / customisability as
> > > > potentially
> > > > good end goal and start with incremental improvements that will also
> > > > address the imperfect explanation of context and decisions and get
> > > > people
> > > > enthused and collaborating on it.
> > > >
> > > > J.
> > > >
> > > > wt., 16 wrz 2025, 11:13 użytkownik asquator asqua...@proton.me
> > > > napisał:
> > > >
> > > > > Well, it's always new contributors who criticize the old code,
> > > > > because
> > > > > they're affected the most.
> > > > > It's not about blaming people, but noticing things "at the margin",
> > > > > ignoring the history and proposing
> > > > > changes that are objectively beneficial at the moment. I did watch
> > > > > Ash's
> > > > > talk before starting to work
> > > > > on the starvation issue, and it's great. I have nothing to say about
> > > > > the
> > > > > algorithm and the logic
> > > > > (mb except it should be asynchronous), but the code structure. Over
> > > > > the
> > > > > years, small patches have been added
> > > > > to one file, and now it has exploded. It's normal. It happens. We
> > > > > just
> > > > > have to realize that and start
> > > > > splitting the code, without drastically changing the logic.
> > > > >
> > > > > Practice shows that modular code is always better than monolithic
> > > > > scripts
> > > > > with multiple responsibilities.
> > > > > I don't say abstract every small part of the code. The "cheapest"
> > > > > way to
> > > > > decide which parts should
> > > > > be abstracted is the simple principle of "if the component has to be
> > > > > replaced once, it can be replaced again in the future".
> > > > > As we're working on the critical section, we're ready to do the extra
> > > > > work
> > > > > to abstract it out and make it
> > > > > pluggable (for developers) - it can be the first step in
> > > > > modularizing the
> > > > > scheduler which we're ready to take.
> > > > >
> > > > > As I have to implement the PR, I'm willing to dive into the tactical
> > > > > level
> > > > > and reach a compromise with the PMCs
> > > > > (who know this project's internals much better) regarding what's the
> > > > > best
> > > > > option to inject things internally.
> > > > > It's a purely technical problem I have to solve to have a working
> > > > > PR. I
> > > > > will do the subclass for testing because
> > > > > it's quick and dirty, but it would be nice to come to a better
> > > > > solution
> > > > > (it shouldn't be too hard) so we can
> > > > > do it the best way before it's (hopefully) merged.
> > > > >
> > > > > On Tuesday, September 16th, 2025 at 7:08 PM, Jarek Potiuk <
> > > > > ja...@potiuk.com> wrote:
> > > > >
> > > > > > I think this is something we should seriously discuss after 3.1 as
> > > > > > many
> > > > > > people are busy now with tying up loose ends.
> > > > > >
> > > > > > But I would be generally in favour of looking at Scheduler logic,
> > > > > > modularising it and making it easier to reason about. I do not
> > > > > > want to
> > > > > > (absolutely) say that it's bad or "should have been done better" or
> > > > > > anything to say always modular code is best (which often might be
> > > > > > implied
> > > > > > as "someone did a bad job here"). This is absolutely not that. It's
> > > > > > very
> > > > > > easy to criticise things (even subconsciously) when you come out
> > > > > > from
> > > > > > outside and you see how things "should have been" - but without
> > > > > > all the
> > > > > > context and history, this is often a bad messaging to those who
> > > > > > spent
> > > > > > years
> > > > > > on making thing work reliably in production for tens of thousands
> > > > > > of
> > > > > > users
> > > > > > and being generally stable and one of the most reliable part of
> > > > > > Airflow
> > > > > > "core".
> > > > > >
> > > > > > There are PLENTY of reasons for scheduler being implemented the
> > > > > > way it
> > > > > > is -
> > > > > > and even trying to approach explaining the history and decision
> > > > > > making
> > > > > > process would take a lot of time. And currently (for good reasons)
> > > > > > the
> > > > > > scheduler "API" is exactly what Ash explained.
> > > > > >
> > > > > > But there is no reason we should not think and make a concerted
> > > > > > (and
> > > > > > group)
> > > > > > effort to modularise it and make scheduler easier to reason about
> > > > > > and -
> > > > > > importantly - being easier for more people to contribute to and
> > > > > > discuss
> > > > > > it
> > > > > > - and most importantly - adding way more modularised tests that
> > > > > > would
> > > > > > also
> > > > > > allow us to break it up and tests parts of it - also for
> > > > > > performance
> > > > > > and
> > > > > > behaviour characteristics. It's not well suited for it today, but -
> > > > > > possibly - it could. And as the most important part of Airflow, we
> > > > > > should make it easier to understand, reason about and contribute
> > > > > > to by
> > > > > > many
> > > > > > people. Simply now there are probably just a few people (Ash being
> > > > > > the
> > > > > > main
> > > > > > person) that can reason and discuss some intrinsic scheduler
> > > > > > insights.
> > > > > > And
> > > > > > If we want to make Airflow sustainable, we should make it easier
> > > > > > for
> > > > > > others
> > > > > > to understand and contribute to it.
> > > > > >
> > > > > > One of the things as a result of it - I would love to have it
> > > > > > better
> > > > > > documented and explained the reasoning behind some decisions and
> > > > > > explaining
> > > > > > how it works (that might be a result of such a concerted effort).
> > > > > >
> > > > > > It's a similar story as with Ci/CD Breeze two years ago - I was the
> > > > > > only
> > > > > > one who really could reason about it but through rewriting in
> > > > > > Python
> > > > > > and documenting with ADRs
> > > > > > https://github.com/apache/airflow/tree/main/dev/breeze/doc/adr
> > > > > > which
> > > > > > still
> > > > > > describe some basic assumptions there and engaging others,
> > > > > > modularising
> > > > > > stuff and getting them to participate I can go now for 3 weeks
> > > > > > vacations
> > > > > > knowing that things will be taken care of, no matter what (Which
> > > > > > BTW. I
> > > > > > am
> > > > > > doing now).
> > > > > >
> > > > > > The "why" and "how" scheduler works is not really documented.
> > > > > > There is
> > > > > > this
> > > > > > fantastic talk by Ash https://www.youtube.com/watch?v=DYC4-xElccE
> > > > > > which
> > > > > > still holds and explains it, but I would love to be able to reason
> > > > > > and
> > > > > > discuss more about it - looking at both code and docs - without
> > > > > > reverse-engineering stuff.
> > > > > >
> > > > > > But I think the goal should be "modularising first" - maybe
> > > > > > resulting
> > > > > > later in easier way of replacing pieces of scheduler, so the
> > > > > > modularising
> > > > > > effort should be guided by the current PRs and ways they are
> > > > > > trying to
> > > > > > address starvation for example. Doing it slowly, with mutliple
> > > > > > people
> > > > > > reviewing, learning and contributing (and documentation created on
> > > > > > the
> > > > > > way).
> > > > > >
> > > > > > I think that should be our initial goal ... then maybe things will
> > > > > > follow.
> > > > > >
> > > > > > J.
> > > > > >
> > > > > > On Tue, Sep 16, 2025 at 8:02 AM asquator asqua...@proton.me wrote:
> > > > > >
> > > > > > > will be undocumented*
> > > > > > >
> > > > > > > On Tuesday, September 16th, 2025 at 5:01 PM, asquator
> > > > > > > asqua...@proton.me
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I see the motivation, but does it have to look so bad?
> > > > > > > >
> > > > > > > > The subclass will look like this:
> > > > > > > >
> > > > > > > > class SchedulerJobRunnerLinearTIScan(SchedulerJobRunner):
> > > > > > > > def init(
> > > > > > > > self,
> > > > > > > > job: Job,
> > > > > > > > num_runs: int = conf.getint("scheduler", "num_runs"),
> > > > > > > > scheduler_idle_sleep_time: float = conf.getfloat("scheduler",
> > > > > > > > "scheduler_idle_sleep_time"),
> > > > > > > > log: Logger | None = None,
> > > > > > > > ):
> > > > > > > > super().init(
> > > > > > > > job=job,
> > > > > > > > num_runs=num_runs,
> > > > > > > > scheduler_idle_sleep_time=scheduler_idle_sleep_time,
> > > > > > > > log=log,
> > > > > > > > )
> > > > > > > > self.task_selector = TASK_SELECTORS[LINEAR_SCAN_SELECTOR]
> > > > > > > >
> > > > > > > > The super class will use the injected hard-coded task selector.
> > > > > > > >
> > > > > > > > Can't we introduce a configuration hierarchy like
> > > > > > > > `core.internal`
> > > > > > > > and
> > > > > > > > put there things not exposed to the end user? So we don't have
> > > > > > > > to
> > > > > > > > do
> > > > > > > > this
> > > > > > > > weird subclassing?
> > > > > > > >
> > > > > > > > It will look thus:
> > > > > > > >
> > > > > > > > class SchedulerJobRunner(...):
> > > > > > > > task_selector_type = conf.get("scheduler.internal",
> > > > > > > > "task_selector_strategy")
> > > > > > > > self.task_selector = TASK_SELECTORS[task_selector_type]
> > > > > > > >
> > > > > > > > We'd just like to have an internal toggle as an implementation
> > > > > > > > detail,
> > > > > > > > which won't be undocumented and custom implementations won't be
> > > > > > > > supported.
> > > > > > > > It's just more convenient and straightforward.
> > > > > > > >
> > > > > > > > Mb there's another way of internal settings management I
> > > > > > > > missed?
> > > > > > > >
> > > > > > > > On Tuesday, September 16th, 2025 at 11:34 AM, Ash Berlin-Taylor
> > > > > > > > a...@apache.org wrote:
> > > > > > > >
> > > > > > > > > > On 16 Sep 2025, at 08:58, asquator asqua...@proton.me
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Yes, exposing pluggable features means fixing an API,
> > > > > > > > > > which is
> > > > > > > > > > confining and just hard to do given the current
> > > > > > > > > > implementation
> > > > > > > > >
> > > > > > > > > class MyScheduler:
> > > > > > > > > def execute(self):
> > > > > > > > > while True:
> > > > > > > > > # Do what ever you want.
> > > > > > > > >
> > > > > > > > > `airflow scheduler --impl=my.module.MyScheduler`
> > > > > > > > >
> > > > > > > > > That is the API.
> > > > > > > > >
> > > > > > > > > That is as pluggable as we need it to be.
> > > > > > > > >
> > > > > > > > > Everything can be built on top of that, including if you want
> > > > > > > > > it, a
> > > > > > > > > pluggable task selection mechanisms.
> > > > > > > > >
> > > > > > > > > Airflow already has too many config options and ways of
> > > > > > > > > tuning
> > > > > > > > > behaviour. We need less of them, not more.
> >
> > ---------------------------------------------------------------------
> >
> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > > > > > For additional commands, e-mail: dev-h...@airflow.apache.org
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > > > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [Discussion] Make the scheduler's task selection algorithm pluggable

Reply via email to