Re: [DISCUSS] Scheduler does unnecessary processing when there are very large scheduled dags

Daniel Standish Wed, 06 Aug 2025 14:01:57 -0700

>
> IMO, the approach on the patch isn't easily maintainable. Most of the
> calculations are performed by SQL in a huge query.
> It would be my preference to have many smaller queries and do part of the
> calculations in python. This will be easier to understand, maintain and
> debug in the future. Also, it will be easier to unit test.



You're talking about https://github.com/apache/airflow/pull/53492/ right?
I agree.  I share the skepticism that it must be one big ugly query.  At a
minimum it needs a lot more work and refinement.  Not something that should
be merged in the current state, even as experimental.

Where is the PR from @Christos?




On Wed, Aug 6, 2025 at 12:54 PM Jens Scheffler <[email protected]>
wrote:

> Hi,
>
> I was (until now) not be able to re-read all the Slack discussion and
> like to make this latest at the weekend. I also like Jarek fear that the
> optimization makes the Scheduler rather hard to maintain. We also had
> some points where we_thought_ we can contribute some optimizations
> especially for Mapped Tasks and then considered the complexity of Mapped
> Task Groups where the Depth-First Strategy would defeat all our drafted
> optimizations. So also in our current apporach we are cutting down the
> Dags in manageable pieces.
>
> So far (I believ, but anybody correct me if I am wrong) the scaling was
> always documented only with options, no real upper boundary (other than
> soft limtis) existing in the code. So the delivered product never
> confirmed fixed upper limits. It might be good also to consider that we
> document where we know there are natural or structural boundaries. So
> hope that I can read more details the next days.
>
> Jens
>
> On 06.08.25 10:31, Jarek Potiuk wrote:
> >> My main issue and the topic of this thread, has been that the scheduler
> > does unnecessary work that leads to decreased throughput. My solution has
> > been to limit the results of the query to the dag cap of active tasks
> that
> > the user has defined.
> >
> > Yes. I understand that. There are situations that cause this "unnecessary
> > work" to be excessive and lead to lower performance and more memory
> usage.
> > This is quite "normal". No system in the world is optimized for all kinds
> > of scenarios and sometimes you need to make trade-offs - for example
> lower
> > performance and maintainability (and support for MySQL and Postgres as
> Ash
> > pointed out in some other threads) which we have to make. There are
> various
> > optimisation goals we can chase: optimal performance and no wasted
> > resources in certain situations and configurations is one of (many) goals
> > we might have. Other goals might include: easier maintainability, better
> > community collaboration, simplicity, less code to maintain, testability,
> > also (what I mentioned before) sometimes deliberate not handling certain
> > scenarios and introducing friction **might** be deliberate decision we
> can
> > take in order to push our users in the direction we want them to go. Yes.
> > As community and maintainers we do not have to always "follow" our users
> > behaviour - we can (and we often do) educate our users and show them
> better
> > ways of doing things.
> >
> > For example we had a LONG discussion whether to introduce caching of
> > Variable values during Dag parsing - because we knew our users are often
> > using Variables in top-level code of their Dags and this leads to a lot
> of
> > waste and high CPU and I/O usage by Dag processor. We finally implemented
> > it as an experimental feature, but it was not at all certain we will - we
> > had to carefully consider what we are trading in exchange for that
> > performance - and whether it's worth it.
> >
> > Same here - I understand: there are some cases (arguably rather niche -
> > with very large Dags) where scheduler does unnecessary processing and
> > performance could be improved. Now - we need to understand what
> trade-offs
> > we need to make as maintainers and community (including our users) if we
> > want to address it. We need to know what complexity is involved, whether
> it
> > will work with Postgres/MySQL and SQlite, whether we will be able to
> > continue debugging and testing it. And whether we want to drive away our
> > user from the modularisation strategy (smaller Dags) that we think makes
> > more sense than bigger Dags. We have to think about what happens next. If
> > we make "huge Dags" first-class-citizens, will it mean that we will have
> to
> > redesign our UI to support them? What should we do when someone opens up
> an
> > issue "I have this 1000000 task Dag and I cannot open Airflow UI - it
> > crashes hard and makes my Airflow instance unusable - please fix it
> ASAP".
> > I certainly would like to avoid such a situation to stress our friend
> > maintainers who work on UI - so also they should have a say on how
> feasible
> > it is to make it "easy" to have "huge Dags" for them.
> >
> > All those factors should be taken into account when you make a "product"
> > decision. Performance gains for particular cases is just one of many
> > factors to consider - and often not the most important ones.
> >
> > J.
> >
> >
> > On Wed, Aug 6, 2025 at 7:34 AM Christos Bisias <[email protected]>
> > wrote:
> >
> >> We also have a dag with dynamic task mapping that can grow immensely.
> >>
> >> I've been looking at https://github.com/apache/airflow/pull/53492.
> >>
> >> My main issue and the topic of this thread, has been that the scheduler
> >> does unnecessary work that leads to decreased throughput. My solution
> has
> >> been to limit the results of the query to the dag cap of active tasks
> that
> >> the user has defined.
> >>
> >> The patch is more focused on the available pool slots. I get the idea
> that
> >> if we can only examine and queue as many tasks as available slots, then
> we
> >> will be efficiently utilizing the available slots to the max, the
> >> throughput will increase and my issue will be solved as well.
> >>
> >> IMO, the approach on the patch isn't easily maintainable. Most of the
> >> calculations are performed by SQL in a huge query.
> >>
> >> It would be my preference to have many smaller queries and do part of
> the
> >> calculations in python. This will be easier to understand, maintain and
> >> debug in the future. Also, it will be easier to unit test.
> >>
> >> On Tue, Aug 5, 2025 at 10:20 PM Jarek Potiuk <[email protected]> wrote:
> >>
> >>> Just a comment here - I am also not opposed as well if optimizations
> will
> >>> be implemented without impacting the more "regular"cases. And -
> >> important -
> >>> without adding huge complexity.
> >>>
> >>> The SQL queries I saw in recent PRs and discussions look both "smart"
> and
> >>> "scary" at the same time. Optimizations like that tend to lead to
> >>> obfuscated, difficult to understand and reason code and "smart"
> >> solutions -
> >>> sometimes "too smart". And when it ends up with one or two people only
> >>> being able to debug and fix problems connected with those, things
> become
> >> a
> >>> little hairy. So whatever we do there, it **must** be not only "smart"
> >> but
> >>> also easy to read and well tested - so that anyone can run the tests
> >> easily
> >>> and reproduce potential failure cases.
> >>>
> >>> And yes I know I am writing this as someone who - for years was the
> only
> >>> one to understand our complex CI setup. But I think over the last two
> >> years
> >>> we are definitely going into, simpler, easier to understand setup and
> we
> >>> have more people on board who know how to deal with it and I think that
> >> is
> >>> a very good direction we are taking :). And I am sure that when I go
> for
> >> my
> >>> planned 3 weeks holidays before the summit, everything will work as
> >>> smoothly as when I am here - at least.
> >>>
> >>> Also I think there is quite a difference (when it comes to scheduling)
> >> when
> >>> you have mapped tasks versus "regular tasks". I think Airflow even
> >>> currently behaves rather differently in those two different cases, and
> >> also
> >>> it has a well-thought and optimized UI experience to handle thousands
> of
> >>> them. Also the work of David Blain on Lazy Expandable Task Mapping will
> >>> push the boundaries of what is possible there as well:
> >>> https://github.com/apache/airflow/pull/51391. Even if we solve
> >> scheduling
> >>> optimization - the UI and ability to monitor such huge Dags is still
> >> likely
> >>> not something our UI was designed for.
> >>>
> >>> And I am fully on board with "splitting to even smaller pieces" and
> >>> "modularizing" things - and "modularizing and splitting big Dags into
> >>> smaller Dags" feels like precisely what should be done. And I think it
> >>> would be a nice idea to try it and follow and see if you can't achieve
> >> the
> >>> same results without adding complexity.
> >>>
> >>> J.
> >>>
> >>>
> >>> On Tue, Aug 5, 2025 at 8:47 PM Ash Berlin-Taylor <[email protected]>
> wrote:
> >>>
> >>>> Yeah dynamic task mapping is a good case where you could easily end up
> >>>> with thousands of tasksof in a dag.
> >>>>
> >>>> As I like to say, Airflow is a broad church and if we’re can
> reasonably
> >>>> support diverse workloads without impacting others (either the
> >> workloads
> >>>> out our available to support and maintain etc) then I’m all for it.
> >>>>
> >>>> In addition to your two items I’d like to add
> >>>>
> >>>> 3. That it doesn’t increase the db’s CPU disproportionally to the
> >>>> increased task throughput
> >>>>
> >>>>> On 5 Aug 2025, at 19:14, asquator <[email protected]>
> >> wrote:
> >>>>> I'm glad this issue finally got enough attention and we can move it
> >>>> forward.
> >>>>> I took a look at @Christos's patch and it makes sense overall, it's
> >>> fine
> >>>> for the specific problem they experienced with max_active_tasks limit.
> >>>>> For those unfamiliar with the core problem, the bug has a plenty of
> >>>> variations where starvation happens due to different concurrency
> >>>> limitations being nearly satiated, which creates the opportunity for
> >> the
> >>>> scheduler to pull many tasks and schedule none of them.
> >>>>> To reproduce this bug, you need two conditions:
> >>>>> 1. Many tasks (>> max_tis) belonging to one "pool", where "pool" is
> >>> some
> >>>> concurrency limitation of Airflow. Note that originally the bug was
> >>>> discovered in context of task pools (see
> >>>> https://github.com/apache/airflow/issues/45636).
> >>>>> 2. The tasks are short enough (or the parallelism is large enough)
> >> for
> >>>> the tasks from the nearly starved pool to free some slots in every
> >>>> scheduler's iteration.
> >>>>> When we discovered a bug that starved our less prioritized pool, even
> >>>> when the most prioritized pool was almost full (thanks to @nevcohen),
> >> we
> >>>> wanted to implement a similar patch @Christos suggested above, but for
> >>>> pools. But then we realized this issue can arise due to limits
> >> different
> >>>> from task pools, including:
> >>>>> max_active_tasks
> >>>>> max_active_tis_per_dag
> >>>>> max_active_tis_per_dagrun
> >>>>>
> >>>>> So we were able to predict the forecoming bug reports for different
> >>>> kinds of starvation, and we started working on the most general
> >> solution
> >>>> which is the topic of this discussion.
> >>>>> I want to also answer @potiuk regarding "why you need such large
> >> DAGs",
> >>>> but I will be brief.
> >>>>> Airflow is an advanced tool for scheduling large data operations, and
> >>>> over the years it has pushed to production many features that lead to
> >>>> organizations writing DAGs that contain thousands of tasks. Most
> >>> prominent
> >>>> one is dynamic task mapping. This feature made us realize we can
> >>> implement
> >>>> a batching work queue pattern and create a task for every unit we have
> >> to
> >>>> process, say it's a file in a specific folder, a path in the
> >> filesystem,
> >>> a
> >>>> pointer to some data stored in object storage, etc. We like to think
> in
> >>>> terms of splitting the work into many tasks. Is it good? I don't know,
> >>> but
> >>>> Airflow has already stepped onto this path, and we have to make it
> >>>> technologically possible (if we can).
> >>>>> Nevertheless, even if such DAGs are considered too big and splitting
> >>>> them is a good idea (though you still have nothing to do with mapped
> >>> tasks
> >>>> - we create tens of thousands of them sometimes and expect them to be
> >>>> processed in parallel), this issue does not only address the described
> >>>> case, but many others, including prioritized pools, mapped tasks or
> >>>> max_active_runs starvation on large backfills.
> >>>>> The only part that's missing now is measuring query time (static
> >>>> benchmarks) and measuring overall scheduling metrics in production
> >>>> workloads (dynamic benchmarks).
> >>>>> We're working hard on this crucial part now.
> >>>>>
> >>>>> We'd be happy to have any assistance from the community as regard to
> >>> the
> >>>> dynamic benchmarks, because every workload is different and it's
> pretty
> >>>> difficult to simulate the general case in such a hard-to-reproduce
> >> issue.
> >>>> We have to make sure that:
> >>>>> 1. In a busy workload, the new logic boosts the scheduler's
> >> throughput.
> >>>>> 2. In a light workload, the nested windowing doesn't significantly
> >> slow
> >>>> down the computation.
> >>>>>
> >>>>>> On Monday, August 4th, 2025 at 9:00 PM, Christos Bisias <
> >>>> [email protected]> wrote:
> >>>>>> I created a draft PR for anyone interested to take a look at the
> >> code
> >>>>>> https://github.com/apache/airflow/pull/54103
> >>>>>>
> >>>>>> I was able to demonstrate the issue in the unit test with much fewer
> >>>> tasks.
> >>>>>> All we need is the tasks brought back by the db query to belong to
> >> the
> >>>> same
> >>>>>> dag_run or dag. This can happen when the first SCHEDULED tasks in
> >> line
> >>>> to
> >>>>>> be examined are at least as many as the number of the tis per query.
> >>>>>>
> >>>>>> On Mon, Aug 4, 2025 at 8:37 PM Daniel Standish
> >>>>>> [email protected] wrote:
> >>>>>>
> >>>>>>>> The configurability was my recommendation for
> >>>>>>>> https://github.com/apache/airflow/pull/53492
> >>>>>>>> Given the fact that this change is at the heart of Airflow I think
> >>> the
> >>>>>>>> changes should be experimental where users can switch between
> >>>> different
> >>>>>>>> strategies/modes of the scheduler.
> >>>>>>>> If and when we have enough data to support that specific option is
> >>>> always
> >>>>>>>> better we can make decisions accordingly.
> >>>>>>> Yeah I guess looking at #53492
> >>>>>>> https://github.com/apache/airflow/pull/53492 it does seem too
> >> risky
> >>> to
> >>>>>>> just change the behavior in airflow without releasing it first as
> >>>>>>> experimental.
> >>>>>>>
> >>>>>>> I doubt we can get sufficient real world testing without doing
> >> that.
> >>>>>>> So if this is introduced, I think it should just be introduced as
> >>>>>>> experimental optimization. And the intention would be that
> >> ultimately
> >>>>>>> there will only be one scheduling mode, and this is just a way to
> >>> test
> >>>> this
> >>>>>>> out more widely. Not that we are intending to have two scheduling
> >>> code
> >>>>>>> paths on a permanent basis.
> >>>>>>>
> >>>>>>> WDYT
> >>>>>>>
> >>>>>>> On Mon, Aug 4, 2025 at 12:50 AM Christos Bisias
> >>> [email protected]
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>> So my question to you is: is it impossible, or just demanding or
> >>>>>>>>> difficult
> >>>>>>>>> to split your Dags into smaller dags connected with asset aware
> >>>>>>>>> scheduling?
> >>>>>>>> Jarek, I'm going to discuss this with the team and I will get you
> >> an
> >>>>>>>> answer
> >>>>>>>> on that.
> >>>>>>>>
> >>>>>>>> I've shared this again on the thread
> >>>>>>>
> >>
> https://github.com/xBis7/airflow/compare/69ab304ffa3d9b847b7dd0ee90ee6ef100223d66..scheduler-perf-patch
> >>>>>>>> I haven't created a PR because this is just a POC and it's also
> >>>> setting a
> >>>>>>>> limit per dag. I would like to get feedback on whether it's better
> >>> to
> >>>>>>>> make
> >>>>>>>> it per dag or per dag_run.
> >>>>>>>> I can create a draft PR if that's helpful and makes it easier to
> >> add
> >>>>>>>> comments.
> >>>>>>>>
> >>>>>>>> Let me try to explain the issue better. From a high level
> >> overview,
> >>>> the
> >>>>>>>> scheduler
> >>>>>>>>
> >>>>>>>> 1. moves tasks to SCHEDULED
> >>>>>>>> 2. runs a query to fetch SCHEDULED tasks from the db
> >>>>>>>> 3. examines the tasks
> >>>>>>>> 4. moves tasks to QUEUED
> >>>>>>>>
> >>>>>>>> I'm focusing on step 2 and afterwards. The current code doesn't
> >> take
> >>>> into
> >>>>>>>> account the max_active_tasks_per_dag. When it runs the query it
> >>>> fetches
> >>>>>>>> up to max_tis which is determined here
> >>>>>>>> <
> >>>>>>>
> >>
> https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/jobs/scheduler_job_runner.py#L697-L705
> >>>>>>>> .
> >>>>>>>>
> >>>>>>>> For example,
> >>>>>>>>
> >>>>>>>> - if the query number is 32
> >>>>>>>> - all 32 tasks in line belong to the same dag, dag1
> >>>>>>>> - we are not concerned how the scheduler picks them
> >>>>>>>> - dag1 has max_active_tasks set to 5
> >>>>>>>>
> >>>>>>>> The current code will
> >>>>>>>>
> >>>>>>>> - get 32 tasks from dag1
> >>>>>>>> - start examining them one by one
> >>>>>>>> - once 5 are moved to QUEUED, it won't stop, it will keep
> >> examining
> >>>>>>>> the other 27 but won't be able to queue them because it has
> >> reached
> >>>>>>>> the
> >>>>>>>> limit
> >>>>>>>>
> >>>>>>>> In the next loop, although we have reached the maximum number of
> >>> tasks
> >>>>>>>> for
> >>>>>>>> dag1, the query will fetch again 32 tasks from dag1 to examine
> >> them
> >>>>>>>> and
> >>>>>>>> try to queue them.
> >>>>>>>>
> >>>>>>>> The issue is that it gets more tasks than it can queue from the db
> >>> and
> >>>>>>>> then
> >>>>>>>> examines them all.
> >>>>>>>>
> >>>>>>>> This all leads to unnecessary processing that builds up and the
> >> more
> >>>> load
> >>>>>>>> there is on the system, the more the throughput drops for the
> >>>> scheduler
> >>>>>>>> and
> >>>>>>>> the workers.
> >>>>>>>>
> >>>>>>>> What I'm proposing is to adjust the query in step 2, to check the
> >>>>>>>> max_active_tasks_per_dag
> >>>>>>>>
> >>>>>>>>> run a query to fetch SCHEDULED tasks from the db
> >>>>>>>> If a dag has already reached the maximum number of tasks in active
> >>>>>>>> states,
> >>>>>>>> it will be skipped by the query.
> >>>>>>>>
> >>>>>>>> Don't we already stop examining at that point? I guess there's two
> >>>>>>>> things
> >>>>>>>>
> >>>>>>>>> you might be referring to. One is, which TIs come out of the db
> >> and
> >>>>>>>>> into
> >>>>>>>>> python, and the other is, what we do in python. Just might be
> >>> helpful
> >>>>>>>>> to
> >>>>>>>>> be clear about the specific enhancements & changes you are
> >> making.
> >>>>>>>> I think that if we adjust the query and fetch the right number of
> >>>> tasks,
> >>>>>>>> then we won't have to make changes to what is done in python.
> >>>>>>>>
> >>>>>>>> On Mon, Aug 4, 2025 at 8:01 AM Daniel Standish
> >>>>>>>> [email protected] wrote:
> >>>>>>>>
> >>>>>>>>> @Christos Bisias
> >>>>>>>>>
> >>>>>>>>> If you have a very large dag, and its tasks have been scheduled,
> >>> then
> >>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>>> scheduler will keep examining the tasks for queueing, even if it
> >>> has
> >>>>>>>>>> reached the maximum number of active tasks for that particular
> >>> dag.
> >>>>>>>>>> Once
> >>>>>>>>>> that fails, then it will move on to examine the scheduled tasks
> >> of
> >>>>>>>>>> the
> >>>>>>>>>> next
> >>>>>>>>>> dag or dag_run in line.
> >>>>>>>>> Can you make this a little more precise? There's some protection
> >>>>>>>>> against
> >>>>>>>>> "starvation" i.e. dag runs recently considered should go to the
> >>> back
> >>>> of
> >>>>>>>>> the
> >>>>>>>>> line next time.
> >>>>>>>>>
> >>>>>>>>> Maybe you could clarify why / how that's not working / not
> >> optimal
> >>> /
> >>>>>>>>> how
> >>>>>>>>> to
> >>>>>>>>> improve.
> >>>>>>>>>
> >>>>>>>>> If there are available slots in the pool and
> >>>>>>>>>
> >>>>>>>>>> the max parallelism hasn't been reached yet, then the scheduler
> >>>>>>>>>> should
> >>>>>>>>>> stop
> >>>>>>>>>> processing a dag that has already reached its max capacity of
> >>> active
> >>>>>>>>>> tasks.
> >>>>>>>>> If a dag run (or dag) is already at max capacity, it doesn't
> >> really
> >>>>>>>>> matter
> >>>>>>>>> if there are slots available or parallelism isn't reached --
> >>>> shouldn't
> >>>>>>>>> it
> >>>>>>>>> stop anyway?
> >>>>>>>>>
> >>>>>>>>> In addition, the number of scheduled tasks picked for examining,
> >>>> should
> >>>>>>>>> be
> >>>>>>>>>
> >>>>>>>>>> capped at the number of max active tasks if that's lower than
> >> the
> >>>>>>>>>> query
> >>>>>>>>>> limit. If the active limit is 10 and we already have 5 running,
> >>> then
> >>>>>>>>>> we
> >>>>>>>>>> can
> >>>>>>>>>> queue at most 5 tasks. In that case, we shouldn't examine more
> >>> than
> >>>>>>>>>> that.
> >>>>>>>>> Don't we already stop examining at that point? I guess there's
> >> two
> >>>>>>>>> things
> >>>>>>>>> you might be referring to. One is, which TIs come out of the db
> >> and
> >>>>>>>>> into
> >>>>>>>>> python, and the other is, what we do in python. Just might be
> >>> helpful
> >>>>>>>>> to
> >>>>>>>>> be clear about the specific enhancements & changes you are
> >> making.
> >>>>>>>>> There is already a patch with the changes mentioned above. IMO,
> >>> these
> >>>>>>>>>> changes should be enabled/disabled with a config flag and not by
> >>>>>>>>>> default
> >>>>>>>>>> because not everyone has the same needs as us. In our testing,
> >>>>>>>>>> adding a
> >>>>>>>>>> limit on the tasks retrieved from the db requires more
> >> processing
> >>> on
> >>>>>>>>>> the
> >>>>>>>>>> query which actually makes things worse when you have multiple
> >>> small
> >>>>>>>>>> dags.
> >>>>>>>>> I would like to see a stronger case made for configurability. Why
> >>>> make
> >>>>>>>>> it
> >>>>>>>>> configurable? If the performance is always better, it should not
> >> be
> >>>>>>>>> made
> >>>>>>>>> configurable. Unless it's merely released as an opt-in
> >> experimental
> >>>>>>>>> feature. If it is worse in some profiles, let's be clear about
> >>> that.
> >>>>>>>>> I did not read anything after `Here is a simple test case that
> >>> makes
> >>>> the benefits of the improvements noticeable` because, it seemed rather
> >>> long
> >>>> winded detail about a test
> >>>>>>>>> case. A higher level summary might be helpful to your audience.
> >> Is
> >>>>>>>>> there
> >>>>>>>>> a PR with your optimization. You wrote "there is a patch" but did
> >>>> not,
> >>>>>>>>> unless I miss something, share it. I would take a look if you
> >> share
> >>>> it
> >>>>>>>>> though.
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>> On Sun, Aug 3, 2025 at 5:08 PM Daniel Standish <
> >>>>>>>>> [email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>>> Yes Ui is another part of this.
> >>>>>>>>>>
> >>>>>>>>>> At some point the grid and graph views completely stop making
> >>> sense
> >>>>>>>>>> for
> >>>>>>>>>> that volume, and another type of view would be required both for
> >>>>>>>>>> usability
> >>>>>>>>>> and performance
> >>>>>>>>>>
> >>>>>>>>>> On Sun, Aug 3, 2025 at 11:04 AM Jens Scheffler
> >>>>>>>>>> [email protected]
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> We also have a current demand to have a workflow to execute 10k
> >>> to
> >>>>>>>>>>> 100k
> >>>>>>>>>>> tasks. Together with @AutomationDev85 we are working on a local
> >>>>>>>>>>> solution
> >>>>>>>>>>> because we also saw problems in the Scheduler that are not
> >>> linearly
> >>>>>>>>>>> scaling. And for sure not easy to be fixed. But from our
> >>>>>>>>>>> investigation
> >>>>>>>>>>> also there are other problems to be considered like UI will
> >> also
> >>>>>>>>>>> potentially have problems.
> >>>>>>>>>>>
> >>>>>>>>>>> I am a bit sceptic that PR 49160 completely fixes the problems
> >>>>>>>>>>> mentioned
> >>>>>>>>>>> here and made some comments. I do not want to stop enthusiasm
> >> to
> >>>> fix
> >>>>>>>>>>> and
> >>>>>>>>>>> improve things but the Scheduler is quite complex and changed
> >>> need
> >>>>>>>>>>> to
> >>>>>>>>>>> be
> >>>>>>>>>>> made with care.
> >>>>>>>>>>>
> >>>>>>>>>>> Actually I like the patch
> >>>>>>>
> >>
> https://github.com/xBis7/airflow/compare/69ab304ffa3d9b847b7dd0ee90ee6ef100223d66..scheduler-perf-patch
> >>>>>>>>>>> as it just adds some limit preventing scheduler to focus on
> >> only
> >>>> one
> >>>>>>>>>>> run. But complexity is a bit big for a "patch" :-D
> >>>>>>>>>>>
> >>>>>>>>>>> I'd also propose atm the way that Jarek described and split-up
> >>> the
> >>>>>>>>>>> Dag
> >>>>>>>>>>> into multiple parts (divide and conquer) for the moment.
> >>>>>>>>>>>
> >>>>>>>>>>> Otherwise if there is a concrete demand on such large Dags...
> >> we
> >>>>>>>>>>> maybe
> >>>>>>>>>>> need rather a broader initiative if we want to ensure 10k,
> >> 100k,
> >>>> 1M?
> >>>>>>>>>>> tasks are supported per Dag. Because depending on the magnitude
> >>> we
> >>>>>>>>>>> strive for different approaches are needed.
> >>>>>>>>>>>
> >>>>>>>>>>> Jens
> >>>>>>>>>>>
> >>>>>>>>>>> On 03.08.25 16:33, Daniel Standish wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Definitely an area of the scheduler with some opportunity for
> >>>>>>>>>>>> performance
> >>>>>>>>>>>> improvement.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I would just mention that, you should also attempt to include
> >>> some
> >>>>>>>>>>>> performance testing at load / scale because, window functions
> >>> are
> >>>>>>>>>>>> going
> >>>>>>>>>>>> to
> >>>>>>>>>>>> be more expensive.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What happens when you have many dags, many historical dag
> >> runs &
> >>>>>>>>>>>> TIs,
> >>>>>>>>>>>> lots
> >>>>>>>>>>>> of stuff running concurrently. You need to be mindful of the
> >>>>>>>>>>>> overall
> >>>>>>>>>>>> impact of such a change, and not look only at the time spent
> >> on
> >>>>>>>>>>>> scheduling
> >>>>>>>>>>>> this particular dag.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I did not look at the PRs yet, maybe you've covered this, but,
> >>>>>>>>>>>> it's
> >>>>>>>>>>>> important.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sun, Aug 3, 2025 at 5:57 AM Christos Bisias<
> >>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I'm going to review the PR code and test it more thoroughly
> >>>>>>>>>>>>> before
> >>>>>>>>>>>>> leaving
> >>>>>>>>>>>>> a comment.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is my code for reference
> >>>>>>>
> >>
> https://github.com/xBis7/airflow/compare/69ab304ffa3d9b847b7dd0ee90ee6ef100223d66..scheduler-perf-patch
> >>>>>>>>>>>>> The current version is setting a limit per dag, across all
> >>>>>>>>>>>>> dag_runs.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Please correct me if I'm wrong, but the PR looks like it's
> >>>>>>>>>>>>> changing
> >>>>>>>>>>>>> the way
> >>>>>>>>>>>>> that tasks are prioritized to avoid starvation. If that's the
> >>>>>>>>>>>>> case,
> >>>>>>>>>>>>> I'm not
> >>>>>>>>>>>>> sure that this is the same issue. My proposal is that, if we
> >>> have
> >>>>>>>>>>>>> reached
> >>>>>>>>>>>>> the max resources assigned to a dag, then stop processing its
> >>>>>>>>>>>>> tasks
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>> move on to the next one. I'm not changing how or which tasks
> >>> are
> >>>>>>>>>>>>> picked.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sun, Aug 3, 2025 at 3:23 PM asquator<[email protected]
> >>>>>>>>>>>>> .invalid>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thank you for the feedback.
> >>>>>>>>>>>>>> Please, describe the case with failing limit checks in the
> >> PR
> >>>>>>>>>>>>>> (DAG's
> >>>>>>>>>>>>>> parameters and it's tasks' parameters and what fails to be
> >>>>>>>>>>>>>> checked)
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>> we'll try to fix it ASAP before you can test it again. Let's
> >>>>>>>>>>>>>> continue
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> PR-related discussion in the PR itself.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Sunday, August 3rd, 2025 at 2:21 PM, Christos Bisias <
> >>>>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thank you for bringing this PR to my attention.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I haven't studied the code but I ran a quick test on the
> >>> branch
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>> completely ignores the limit on scheduled tasks per dag or
> >>>>>>>>>>>>>>> dag_run.
> >>>>>>>>>>>>>>> It
> >>>>>>>>>>>>>>> grabbed 70 tasks from the first dag and then moved all 70
> >> to
> >>>>>>>>>>>>>>> QUEUED
> >>>>>>>>>>>>>>> without
> >>>>>>>>>>>>>>> any further checks.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is how I tested it
> >>>>>>>
> >>
> https://github.com/Asquator/airflow/compare/feature/pessimistic-task-fetching-with-window-function...xBis7:airflow:scheduler-window-function-testing?expand=1
> >>>>>>>>>>>>>>> On Sun, Aug 3, 2025 at 1:44 PM [email protected]
> >>>>>>>>>>>>>>> .invalid
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is a known issue stemming from the optimistic
> >>> scheduling
> >>>>>>>>>>>>>>>> strategy
> >>>>>>>>>>>>>>>> used in Airflow. We do address this in the above-mentioned
> >>>>>>>>>>>>>>>> PR. I
> >>>>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> note that there are many cases where this problem may
> >>>>>>>>>>>>>>>> appear—it
> >>>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>>>> originally detected with pools, but we are striving to fix
> >>> it
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>> cases,
> >>>>>>>>>>>>>>>> such as the one described here with
> >> max_active_tis_per_dag,
> >>> by
> >>>>>>>>>>>>>>>> switching to
> >>>>>>>>>>>>>>>> pessimistic scheduling with SQL window functions. While
> >> the
> >>>>>>>>>>>>>>>> current
> >>>>>>>>>>>>>>>> strategy simply pulls the max_tis tasks and drops the ones
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>> meet
> >>>>>>>>>>>>>>>> the constraints, the new strategy will pull only the tasks
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>> actually ready to be scheduled and that comply with all
> >>>>>>>>>>>>>>>> concurrency
> >>>>>>>>>>>>>>>> limits.
> >>>>>>>>>>>>>>>> It would be very helpful for pushing this change to
> >>> production
> >>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>> could assist us in alpha-testing it.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> See also:
> >>>>>>>>>>>>>>>> https://github.com/apache/airflow/discussions/49160
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Sent with Proton Mail secure email.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Sunday, August 3rd, 2025 at 12:59 PM, Elad Kalif
> >>>>>>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> i think most of your issues will be addressed by
> >>>>>>>>>>>>>>>>> https://github.com/apache/airflow/pull/53492
> >>>>>>>>>>>>>>>>> The PR code can be tested with Breeze so you can set it
> >> up
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> see
> >>>>>>>>>>>>>>>>> if it
> >>>>>>>>>>>>>>>>> solves the problem this will also help with confirming
> >> it's
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> right
> >>>>>>>>>>>>>>>>> fix.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Sun, Aug 3, 2025 at 10:46 AM Christos Bisias
> >>>>>>>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The scheduler is very efficient when running a large
> >>> amount
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> dags
> >>>>>>>>>>>>>>>>>> with up
> >>>>>>>>>>>>>>>>>> to 1000 tasks each. But in our case, we have dags with
> >> as
> >>>>>>>>>>>>>>>>>> many
> >>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>> 10.000
> >>>>>>>>>>>>>>>>>> tasks. And in that scenario the scheduler and worker
> >>>>>>>>>>>>>>>>>> throughput
> >>>>>>>>>>>>>>>>>> drops
> >>>>>>>>>>>>>>>>>> significantly. Even if you have 1 such large dag with
> >>>>>>>>>>>>>>>>>> scheduled
> >>>>>>>>>>>>>>>>>> tasks,
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> performance hit becomes noticeable.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> We did some digging and we found that the issue comes
> >> from
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> scheduler's
> >>>>>>>>>>>>>>>>>> _executable_task_instances_to_queued
> >>>>>>>>>>>>>>>>>> <
> >>>>>>>
> >>
> https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/jobs/scheduler_job_runner.py#L293C9-L647
> >>>>>>>>>>>>>>>>>> method.
> >>>>>>>>>>>>>>>>>> In particular with the db query here
> >>>>>>>>>>>>>>>>>> <
> >>>>>>>
> >>
> https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/jobs/scheduler_job_runner.py#L364-L375
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> examining the results here
> >>>>>>>>>>>>>>>>>> <
> >>>>>>>
> >>
> https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/jobs/scheduler_job_runner.py#L425
> >>>>>>>>>>>>>>>>>> .
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If you have a very large dag, and its tasks have been
> >>>>>>>>>>>>>>>>>> scheduled,
> >>>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> scheduler will keep examining the tasks for queueing,
> >> even
> >>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>> reached the maximum number of active tasks for that
> >>>>>>>>>>>>>>>>>> particular
> >>>>>>>>>>>>>>>>>> dag.
> >>>>>>>>>>>>>>>>>> Once
> >>>>>>>>>>>>>>>>>> that fails, then it will move on to examine the
> >> scheduled
> >>>>>>>>>>>>>>>>>> tasks
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> next
> >>>>>>>>>>>>>>>>>> dag or dag_run in line.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This is inefficient and causes the throughput of the
> >>>>>>>>>>>>>>>>>> scheduler
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> workers to drop significantly. If there are available
> >>> slots
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> pool and
> >>>>>>>>>>>>>>>>>> the max parallelism hasn't been reached yet, then the
> >>>>>>>>>>>>>>>>>> scheduler
> >>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> stop
> >>>>>>>>>>>>>>>>>> processing a dag that has already reached its max
> >> capacity
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> active
> >>>>>>>>>>>>>>>>>> tasks.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In addition, the number of scheduled tasks picked for
> >>>>>>>>>>>>>>>>>> examining,
> >>>>>>>>>>>>>>>>>> should be
> >>>>>>>>>>>>>>>>>> capped at the number of max active tasks if that's lower
> >>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> query
> >>>>>>>>>>>>>>>>>> limit. If the active limit is 10 and we already have 5
> >>>>>>>>>>>>>>>>>> running,
> >>>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>> we can
> >>>>>>>>>>>>>>>>>> queue at most 5 tasks. In that case, we shouldn't
> >> examine
> >>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>> that.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> There is already a patch with the changes mentioned
> >> above.
> >>>>>>>>>>>>>>>>>> IMO,
> >>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>> changes should be enabled/disabled with a config flag
> >> and
> >>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> default
> >>>>>>>>>>>>>>>>>> because not everyone has the same needs as us. In our
> >>>>>>>>>>>>>>>>>> testing,
> >>>>>>>>>>>>>>>>>> adding a
> >>>>>>>>>>>>>>>>>> limit on the tasks retrieved from the db requires more
> >>>>>>>>>>>>>>>>>> processing
> >>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> query which actually makes things worse when you have
> >>>>>>>>>>>>>>>>>> multiple
> >>>>>>>>>>>>>>>>>> small
> >>>>>>>>>>>>>>>>>> dags.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Here is a simple test case that makes the benefits of
> >> the
> >>>>>>>>>>>>>>>>>> improvements
> >>>>>>>>>>>>>>>>>> noticeable
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - we have 3 dags with thousands of tasks each
> >>>>>>>>>>>>>>>>>> - for simplicity let's have 1 dag_run per dag
> >>>>>>>>>>>>>>>>>> - triggering them takes some time and due to that, the
> >>> FIFO
> >>>>>>>>>>>>>>>>>> order
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> tasks is very clear
> >>>>>>>>>>>>>>>>>> - e.g. 1000 tasks from dag1 were scheduled first and
> >> then
> >>>>>>>>>>>>>>>>>> 200
> >>>>>>>>>>>>>>>>>> tasks
> >>>>>>>>>>>>>>>>>> from dag2 etc.
> >>>>>>>>>>>>>>>>>> - the executor has parallelism=100 and
> >> slots_available=100
> >>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>> means
> >>>>>>>>>>>>>>>>>> that it can run up to 100 tasks concurrently
> >>>>>>>>>>>>>>>>>> - max_active_tasks_per_dag is 4 which means that we can
> >>> have
> >>>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> 4
> >>>>>>>>>>>>>>>>>> tasks running per dag.
> >>>>>>>>>>>>>>>>>> - For 3 dags, it means that we can run up to 12 tasks at
> >>> the
> >>>>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>> time (4 tasks from each dag)
> >>>>>>>>>>>>>>>>>> - max tis per query are set to 32, meaning that we can
> >>>>>>>>>>>>>>>>>> examine
> >>>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>> to 32
> >>>>>>>>>>>>>>>>>> scheduled tasks if there are available pool slots
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If we were to run the scheduler loop repeatedly until it
> >>>>>>>>>>>>>>>>>> queues
> >>>>>>>>>>>>>>>>>> 12
> >>>>>>>>>>>>>>>>>> tasks
> >>>>>>>>>>>>>>>>>> and test the part that examines the scheduled tasks and
> >>>>>>>>>>>>>>>>>> queues
> >>>>>>>>>>>>>>>>>> them,
> >>>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - with the query limit
> >>>>>>>>>>>>>>>>>> - 1 iteration, total time 0.05
> >>>>>>>>>>>>>>>>>> - During the iteration
> >>>>>>>>>>>>>>>>>> - we have parallelism 100, available slots 100 and query
> >>>>>>>>>>>>>>>>>> limit
> >>>>>>>>>>>>>>>>>> 32
> >>>>>>>>>>>>>>>>>> which means that it will examine up to 32 scheduled
> >> tasks
> >>>>>>>>>>>>>>>>>> - it can queue up to 100 tasks
> >>>>>>>>>>>>>>>>>> - examines 12 tasks (instead of 32)
> >>>>>>>>>>>>>>>>>> - 4 tasks from dag1, reached max for the dag
> >>>>>>>>>>>>>>>>>> - 4 tasks from dag2, reached max for the dag
> >>>>>>>>>>>>>>>>>> - and 4 tasks from dag3, reached max for the dag
> >>>>>>>>>>>>>>>>>> - queues 4 from dag1, reaches max for the dag and moves
> >> on
> >>>>>>>>>>>>>>>>>> - queues 4 from dag2, reaches max for the dag and moves
> >> on
> >>>>>>>>>>>>>>>>>> - queues 4 from dag3, reaches max for the dag and moves
> >> on
> >>>>>>>>>>>>>>>>>> - stops queueing because we have reached the maximum per
> >>>>>>>>>>>>>>>>>> dag,
> >>>>>>>>>>>>>>>>>> although there are slots for more tasks
> >>>>>>>>>>>>>>>>>> - iteration finishes
> >>>>>>>>>>>>>>>>>> - without
> >>>>>>>>>>>>>>>>>> - 3 iterations, total time 0.29
> >>>>>>>>>>>>>>>>>> - During iteration 1
> >>>>>>>>>>>>>>>>>> - Examines 32 tasks, all from dag1 (due to FIFO)
> >>>>>>>>>>>>>>>>>> - queues 4 from dag1 and tries to queue the other 28 but
> >>>>>>>>>>>>>>>>>> fails
> >>>>>>>>>>>>>>>>>> - During iteration 2
> >>>>>>>>>>>>>>>>>> - examines the next 32 tasks from dag1
> >>>>>>>>>>>>>>>>>> - it can't queue any of them because it has reached the
> >>> max
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> dag1, since the previous 4 are still running
> >>>>>>>>>>>>>>>>>> - examines 32 tasks from dag2
> >>>>>>>>>>>>>>>>>> - queues 4 from dag2 and tries to queue the other 28 but
> >>>>>>>>>>>>>>>>>> fails
> >>>>>>>>>>>>>>>>>> - During iteration 3
> >>>>>>>>>>>>>>>>>> - examines the next 32 tasks from dag1, same tasks that
> >>> were
> >>>>>>>>>>>>>>>>>> examined in iteration 2
> >>>>>>>>>>>>>>>>>> - it can't queue any of them because it has reached the
> >>> max
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> dag1 and the first 4 are still running
> >>>>>>>>>>>>>>>>>> - examines 32 tasks from dag2 , can't queue any of them
> >>>>>>>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>>>> it has reached max for dag2 as well
> >>>>>>>>>>>>>>>>>> - examines 32 tasks from dag3
> >>>>>>>>>>>>>>>>>> - queues 4 from dag3 and tries to queue the other 28 but
> >>>>>>>>>>>>>>>>>> fails
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I used very low values for all the configs so that I can
> >>>>>>>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> point
> >>>>>>>>>>>>>>>>>> clear and easy to understand. If we increase them, then
> >>> this
> >>>>>>>>>>>>>>>>>> patch
> >>>>>>>>>>>>>>>>>> also
> >>>>>>>>>>>>>>>>>> makes the task selection more fair and the resource
> >>>>>>>>>>>>>>>>>> distribution
> >>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>> even.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I would appreciate it if anyone familiar with the
> >>>>>>>>>>>>>>>>>> scheduler's
> >>>>>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> confirm this and also provide any feedback.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Additionally, I have one question regarding the query
> >>> limit.
> >>>>>>>>>>>>>>>>>> Should it
> >>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> per dag_run or per dag? I've noticed that
> >>>>>>>>>>>>>>>>>> max_active_tasks_per_dag
> >>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>> been changed to provide a value per dag_run but the docs
> >>>>>>>>>>>>>>>>>> haven't
> >>>>>>>>>>>>>>>>>> been
> >>>>>>>>>>>>>>>>>> updated.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thank you!
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>> Christos Bisias
> >>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>> To unsubscribe, e-mail:[email protected]
> >>>>>>>>>>>>>>>> For additional commands,
> >> e-mail:[email protected]
> >>>>>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>>>>>>>>>> To unsubscribe, e-mail:[email protected]
> >>>>>>>>>>>>>> For additional commands, e-mail:[email protected]
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [email protected]
> >>>>> For additional commands, e-mail: [email protected]
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] Scheduler does unnecessary processing when there are very large scheduled dags

Reply via email to