Hi Jarek, et al.
I assume so that there are many cases. Some people might want to limit
the "workers" and in these cases the deferred are not likely counting-in
and in other cases the limits are rather defined to protect backends
(like with pools). This is also like we (urgently) need this and why the
bug came-up.
I think adding more parameters, options and renaming them all to proper
namign is a mid- to long excercise especially as it involves all users
Dag migration. So this is a change that will maybe only have a final
cleanup in Airflow 4. We are not green-field thus we anyway need
backward compatibility. Aligning on the naming (which is always the
hardest...) and making this change will take longer.
Therefore I'd propose (1) a pragmatic fix that can be made NOW as a
bugfix and that is a global config switch similar like pools. Aim to
tell on a global level to count deferred in-or out. And following-up as
(2) a rework of the limitation parameters which - quite frankly - are a
bit fragmented in various areas.
Would this be OK for most?
Jens
P.S.: I need to highlight here that after recent migration to Airflow 3
in our production we had the last days serious problems and very bad
feedback from our users. Took us more than a day to understand and drill
into the root cause - so multiple person days wasted until we found the
root in the discussion
https://lists.apache.org/thread/nn4y1z0yrydkmw9np4f0z5lm9gh8tmfl with
the lazy consensus
https://lists.apache.org/thread/9o84d3yn934m32gtlpokpwtbbmtxj47l and the
PR https://github.com/apache/airflow/pull/42953 causing this trouble. Why?
Because in Airflow 2 we used max_active_tasks which defaulted to 16 as a
safety net, everybody could write their Dag. If somebody wanted to scale
larger then a PR with increasing max_active_tasks > 16 caused a review
and we could see which Dag effectively took however much resources. With
the (badly documented!) semantic change in Airflow 3 now a lot of
workload runs un-restricted because relying on mapped tasks and the
alternative max_active_tis_per_dag is not defaulted to 16 like the
previous as well as is only counted per task... so if you have a Dag
with multiple mapped tasks and Pools are needed for other limits this is
a problem now.
I do not understand why the discussion here is more preceise... when the
mentioned change about the max_active_tasks was not respecting this at
all and was semantically just a breaking change... yeah my bad that I
missed the discussion but this is really really a problem now for us :-(
On 09.03.26 10:18, Jarek Potiuk wrote:
+1 on what TP and Karthikeyan wrote. We need a solid proposal for naming
and explicitly defining those terms, along with a way for users to keep the
old counting method (settable per Dag). And I think it would be ok to
change default behaviour as long as we are very clear in documenting it,
explaining that this is really a "bug fix" (in the sense that this
behaviour was really not intentional and by changing it we express our
intentions explicitly) and allow the users to go back easily in Dags that
rely on it - so that they can maybe rework them in the future and remove
it.
On Mon, Mar 9, 2026 at 8:35 AM Karthikeyan <[email protected]> wrote:
+1 on having a field to restore backwards compatible on dag level if the
dag parameter is being changed. Most of our workloads involve submitting
jobs to Spark and other upstream systems and each user has a corresponding
pool. With deferred being not counted as active users had issues where many
submissions were done that the upstream cannot handle. So for those users
the pool was updated. There are other workloads like http based defers
where users just poll and don't need to worry about the upstream capacity.
I guess deferred was initially documented such that pool slot is released
and there can be more concurrent workloads. The definition of being counted
as active depends on the workload and use case. It will be helpful to have
the option to have this behaviour as optional and opt-in to avoid
confusion.
Thanks
Regards,
Karthikeyan S
On Mon, Feb 23, 2026, 9:43 PM Vikram Koka via dev <[email protected]>
wrote:
I definitely agree with the intent, but concerned about the actual
implications of making this change from a user experience perspective.
With respect to pools, I would like an updated perspective on how useful
and used this is today. For example, I suspect that the async Python
operator change coming in the new AIP as part of 3.2 does not respect the
pools configuration either.
The max active task configurations are very useful while using the Celery
executor, which is the majority today. I got a bunch of questions around
this as part of the backfill enhancements in 3.0.
I hesitate making changes to these configuration options without clear
understanding and articulation of the tradeoffs.
Just my two cents,
Vikram
On Mon, Feb 23, 2026 at 2:34 AM Wei Lee <[email protected]> wrote:
I like what Jarek suggested, but we should avoid using the term
"Running".
From Airflow's perspective, a Deferred task is not considered a Running
task, even though it may be viewed differently in the user's context.
Additionally, we are currently using the term "Executing" here
https://github.com/apache/airflow/blob/e0cd6e246c288d33f359ec2268b3d342832e9648/airflow-core/src/airflow/utils/state.py#L67
Maybe we can count Deferred and Running tasks as "Executing"? The thing
that kinda bugs me is that "Defered" is also an IntermediateTIState
here.
On 2026/02/22 20:22:45 Natanel wrote:
Hello Jens, I agree with everything you said, for some reason the
"Deferred" state is not counted towards an active task, where
intuitively
it should be part of the group.
As I see it at least, all the configurations talk about *active*
tasks
(such as max_active_tasks, max_active_tis_per_dag,
max_active_tis_per_dagrun), which I think is a quite confusing term.
And to solve this, a clear explanation of what is an "active" task
should
be defined.
It is possible to define that an "active" task is any task which is
either
running, queued OR deferred, but this will enforce a new
configuration
for
backwards compatibility, such as "count_deffered_as_active" (yet this
is
more enforcing and global approach, which we might not want), while
not
introducing too much additional complexity, as adding more parameters
by
which we schedule tasks will only make scheduling decisions harder,
as
more
parameters need to be checked, which will most likely slow down each
decision, and might slow down the scheduler.
I liked Jarek's approach, however, I think that maybe instead of
introducing a few new params, we instead rename the current
parameters,
while keeping behavior as is, slowly deprecating the "active"
configurations, as Jarek said, and for some time keep both the
"active"
and
the "running" param, while having the "active" be an alias for
"running"
until the "active" is deprecated.
If there is a need for a param for deferred tasks, it is possible to
add
only for deferrable tasks, in order to not impact current scheduling
decisions done by the scheduler.
I see both approaches as viable, yet I think that adding an
additional
param might introduce more complexity, and maybe should be split out
of
the
regular task flow, as a deferrable task is not the same as a running
task,
I tend to lean towards the first approach, as it seems to be the
simplest,
however, the second approach might be more beneficial long-term.
Best Regards,
Natanel
On Sun, 22 Feb 2026 at 18:43, Jarek Potiuk <[email protected]> wrote:
+1. But I think that there are cases where people wanted to
**actually** use `max_*` to limit how many workers the DAG or DAG
run
will take. So possibly we should give them such an option—for
example,
max_running_tis_per_dag, etc.
There is also the question of backward compatibility. I can see the
possibility of side effects - if that changes "suddenly" after an
upgrade. For example it might mean that some Dags will suddenly
start
using far fewer workers than before and become starved.
So - if we want to change it, I think we should deprecate "_active"
and possibly add two new sets of parameters with different
names—but
naming in this case is hard (more than usual).
J.
On Sun, Feb 22, 2026 at 5:25 PM Pavankumar Gopidesu
<[email protected]> wrote:
Hi Jens,
Thanks for starting this discussion. I agree that we should
update
how
these tasks are counted.
I previously started a PR[1] to include deferred tasks in
max_active_tasks,
but I was sidetracked by other priorities. As you noted, this
change
needs
to encompass not only max_active_tasks but also the other
parameters
you
described.
[1]: https://github.com/apache/airflow/pull/41560
Regards,
Pavan
On Sun, Feb 22, 2026 at 12:43 PM constance.astronomer.io via
dev <
[email protected]> wrote:
Agreed. In my opinion, the only time we should not be counting
deferred
tasks are for configurations that control worker slots, like
number of
tasks that run concurrently on a celery worker, since tasks in
a
deferred
state are not running on a worker (although you can argue that
a
triggerer
is a special kind of worker but I digress).
For the examples you’ve listed, deferred tasks should be part
of
the
equation since the task IS running, just not in a traditional
worker.
Thanks for bringing this up! This has been bothering me for
awhile.
Constance
On Feb 22, 2026, at 4:18 AM, Jens Scheffler <
[email protected]
wrote:
Hi There!
TLDR: In fix PR https://github.com/apache/airflow/pull/61769
we
came to
the point that it seems today in Airflow Core the "Deferred"
state
seems to
be counted inconsistently. I would propose to consistently
count
"Deferred"
into the counts of "Running".
Details:
* In Pools for a longer time (since PR
https://github.com/apache/airflow/pull/32709) it is
possible
to
decide whether tasks in deferred state are counted into
pool
allocation or not.
* Before that Deferred were not counted into, which caused
tasks
being
in deferred potentially overwhelm backends which defesated
the
purpose of pools
* Recently it was also seen that other limits we usually have
on
Dags
defined as following do not consistently include deferred
into
limits.
o max_active_tasks - `The number of task instances
allowed
to run
concurrently`
o max_active_tis_per_dag - `When set, a task will be able
to
limit
the concurrent runs across logical_dates.`
o max_active_tis_per_dagrun - `When set, a task will be
able
to
limit the concurrent task instances per Dag run.`
* This means at the moment defining a task as async/deferred
escapes
the limits
Code references:
* Counting tasks in Scheduler on main:
https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/jobs/scheduler_job_runner.py#L190
* EXECUTION_STATES used for counting:
https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/ti_deps/dependencies_states.py#L21
o Here "Deferred" is missing!
Alternatives that I see:
* Fix it in Scheduler consistently that limits are applied
counting
Deferred always in
* There might be a historic reason that Deferred is not
counting
in -
then a proper documentation would be needed - but I'd
assume
this
un-likely
* There are different opinions - then the behavior might need
to
be
configurable. (But personally I can not see a reason for
having
deferred escaping the limits defined)
Jens
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]