Apologies for the delayed response, I was OOO last week.

Thanks for writing this up, Dennis -- really helpful to have the rationale
documented in one place.

Reasoning behind the "two collection" approach with callbacks having higher
priority over tasks makes sense to me.

I think shipping with FIFO and no `max_executor_callbacks` for now is fine
too.

Now, building on Vikram's qn, one thing that might be worth clarifying is:
in a
multi-queue executor setup (celery with multiple workers/queues), which
*queue*
do the callbacks get routed to? (what happens when the executor actually
dispatches
the callback to a worker) Is it always the default queue or is there some
sort
of affinity to the other queues? Might be worth documenting it even if the
answer
is a no brainer.


Thanks & Regards,
Amogh Desai


On Wed, Mar 4, 2026 at 5:42 AM Ferruzzi, Dennis <[email protected]> wrote:

> We are going to go round and round on the "I think I know what you mean,
> but..." carousel 😄
>
> I suspect you are thinking in terms of Airflow Queues like the Celery
> Queue where you can direct tasks to a specific Celery worker.  I was using
> "callback queue"  to refer to an internal data structure within the
> executor, not a user-facing Queue.  I could have easily called it a FIFO
> collection or list or whatever, and maybe I should have.
>
> Internally, the executor has two collections:  one stores the callbacks in
> FIFO order, and one stores the prioritized tasks.  When a slot opens up, it
> takes the top of the callback list if any are available, then fills the
> remaining slots from the list of tasks.  From the user's point of view it's
> one prioritized queue where all callbacks get top priority.
>
> I suspect there is confusion because I've somewhat overloaded the term
> Queue.  I'm not sure there is any configuration to update.
>
> - ferruzzi
> ________________________________
> From: Vikram Koka via dev <[email protected]>
> Sent: Tuesday, March 3, 2026 8:54 AM
> To: [email protected] <[email protected]>
> Cc: Vikram Koka <[email protected]>
> Subject: RE: [EXT] [AIP-86] Deadline Callback queuing and priority
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
> Dennis,
>
> Thank you for writing this up.
>
> I completely missed this concept of “callback queue” as a parallel FIFO to
> the task queue during the AIP discussions and in the AIP document itself.
> So, this definitely helps clarify that part for me.
>
> I understand the comment about honoring the parallelism config vs. not
> honoring the DAG level parallelism configurations.
>
> What I am not clear about is the how this “callback queue” maps to the
> existing queues already configured within the Airflow deployments, when
> multiple queues are configured. I can guess, but would prefer to have a
> definitive answer based on your thinking and implementation.
> This then leads to the updated configuration needed to take advantage of
> this feature.
>
> Best regards,
> Vikram
>
> On Fri, Feb 27, 2026 at 3:27 PM Ferruzzi, Dennis <[email protected]>
> wrote:
>
> > I think a lot of this has been discussed in dev calls but maybe never got
> > documented anywhere for folks who aren't/weren't on the call.  I haven't
> > been very good at all about keeping the AIP up to date after it was
> > approved, or keeping community decisions and discussions easily
> > discoverable, and that's on me.  Now that folks can see the feature
> > as-implemented, Amogh raised some good questions that are worth
> addressing
> > here so the rationale is captured somewhere persistent. I also want to
> flag
> > a couple of ideas that came out of that discussion as potential follow-up
> > improvements.
> >
> > The plan has been that we treat a deadline callback as "finishing old
> > work" and therefore all callbacks get priority over new tasks.  The core
> > design principle here is that deadline callbacks need to be timely — if a
> > deadline fires, the user expects to know about it promptly. A deadline
> > callback that gets stuck behind the same bottleneck it's trying to alert
> > you about isn't useful.
> >
> > To that end, what I implemented was effectively two queues:
> > task_instances get queued by priority_weight (as they "always" have), and
> > callbacks get queued in FIFO order.  When the scheduler looks for new
> work
> > it will always fill slots from the callback queue first.  Parallelism is
> > honored, but pools and max_active_tasks_per_dag are ignored.  The reason
> > for that is twofold.  Let's say a user has a Dag with a deadline at
> > 10-minutes.  There are two cases where that deadline can be triggered:
> >
> > Case 1) This Dag is still running at the deadline, the callback triggers
> -
> > maybe a Slack message that the report is still being generated and may be
> > delayed today - while the Dag still runs.
> >
> > In this case, `pools` and `tasks_per_dag` may make sense as the dag is
> > still active, but I'd argue that in this case the deadline should "break
> > through" that pool/task limit and execute regardless.  For example, if
> the
> > issue is that the tasks stalled because the worker is hitting resource
> > constraints then tacking the deadline callback on behind that roadblock
> and
> > waiting for them to finish before alerting the user that there is a
> problem
> > defeats the intent of the Deadline.  It is still bound by the
> > executor-level parallelism so it's not allowed to just run rampant, but
> it
> > isn't bound by the dag-level constraints.
> >
> > Case 2) The Dag failed at 2 minutes and the callback is triggered at the
> > 2-minute point.  At this point there is no `pool` or `max_tasks` to
> > consider.  The task instances should have released their slots and aren't
> > being counted toward the active tasks.  Even if other Dags have started
> up
> > and are claiming pool slots, this callback isn't part of that Dag and
> > shouldn't be lumped with its tasks.
> >
> > A `max_executor_callbacks` setting which parallels max_executor_tasks is
> > one idea that has come up.  It's not a bad idea; I guess if the whole
> > building is burning down, you don't need to know that each floor is
> > burning. It feels a bit against the intention of the callbacks getting
> > prioritized over tasks, but if it's a user-defined option then that's on
> > the user. It might be worth considering as a follow-up improvement if
> > anyone thinks it's something we really need.
> >
> > The other decision that seems to be contentious is that callbacks are
> FIFO
> > instead of implementing a `callback-priority-weight`.  FIFO seems fine
> for
> > a callback queue and how I envision the feature being used, but maybe
> once
> > the feature gets in the hands of users they'll find a need for it.  We
> as a
> > community have been saying for a while that there are way too many
> > user-facing knobs in the settings and this felt like a logical place for
> us
> > to be opinionated in the code.  With FIFO, the feature's behavior is
> > straightforward and predictable, and it's easier to add the weight later
> if
> > there is demand than it would be to remove it later if we decide to prune
> > the config options in a future version.
> >
> > For now, I think we're in a good place to ship with the current behavior
> > and we can iterate based on real-world usage.  I'll cut some Issues to
> make
> > sure the follow-up ideas from here and the PR are tracked so they don't
> get
> > lost.  If anyone has concerns about the current behavior that should be
> > addressed before launch, let me know.
> >
> >
> >   *
> > ferruzzi
> >
>

Reply via email to