Apologies for the delayed response, I was OOO last week. Thanks for writing this up, Dennis -- really helpful to have the rationale documented in one place.
Reasoning behind the "two collection" approach with callbacks having higher priority over tasks makes sense to me. I think shipping with FIFO and no `max_executor_callbacks` for now is fine too. Now, building on Vikram's qn, one thing that might be worth clarifying is: in a multi-queue executor setup (celery with multiple workers/queues), which *queue* do the callbacks get routed to? (what happens when the executor actually dispatches the callback to a worker) Is it always the default queue or is there some sort of affinity to the other queues? Might be worth documenting it even if the answer is a no brainer. Thanks & Regards, Amogh Desai On Wed, Mar 4, 2026 at 5:42 AM Ferruzzi, Dennis <[email protected]> wrote: > We are going to go round and round on the "I think I know what you mean, > but..." carousel 😄 > > I suspect you are thinking in terms of Airflow Queues like the Celery > Queue where you can direct tasks to a specific Celery worker. I was using > "callback queue" to refer to an internal data structure within the > executor, not a user-facing Queue. I could have easily called it a FIFO > collection or list or whatever, and maybe I should have. > > Internally, the executor has two collections: one stores the callbacks in > FIFO order, and one stores the prioritized tasks. When a slot opens up, it > takes the top of the callback list if any are available, then fills the > remaining slots from the list of tasks. From the user's point of view it's > one prioritized queue where all callbacks get top priority. > > I suspect there is confusion because I've somewhat overloaded the term > Queue. I'm not sure there is any configuration to update. > > - ferruzzi > ________________________________ > From: Vikram Koka via dev <[email protected]> > Sent: Tuesday, March 3, 2026 8:54 AM > To: [email protected] <[email protected]> > Cc: Vikram Koka <[email protected]> > Subject: RE: [EXT] [AIP-86] Deadline Callback queuing and priority > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que > le contenu ne présente aucun risque. > > > > Dennis, > > Thank you for writing this up. > > I completely missed this concept of “callback queue” as a parallel FIFO to > the task queue during the AIP discussions and in the AIP document itself. > So, this definitely helps clarify that part for me. > > I understand the comment about honoring the parallelism config vs. not > honoring the DAG level parallelism configurations. > > What I am not clear about is the how this “callback queue” maps to the > existing queues already configured within the Airflow deployments, when > multiple queues are configured. I can guess, but would prefer to have a > definitive answer based on your thinking and implementation. > This then leads to the updated configuration needed to take advantage of > this feature. > > Best regards, > Vikram > > On Fri, Feb 27, 2026 at 3:27 PM Ferruzzi, Dennis <[email protected]> > wrote: > > > I think a lot of this has been discussed in dev calls but maybe never got > > documented anywhere for folks who aren't/weren't on the call. I haven't > > been very good at all about keeping the AIP up to date after it was > > approved, or keeping community decisions and discussions easily > > discoverable, and that's on me. Now that folks can see the feature > > as-implemented, Amogh raised some good questions that are worth > addressing > > here so the rationale is captured somewhere persistent. I also want to > flag > > a couple of ideas that came out of that discussion as potential follow-up > > improvements. > > > > The plan has been that we treat a deadline callback as "finishing old > > work" and therefore all callbacks get priority over new tasks. The core > > design principle here is that deadline callbacks need to be timely — if a > > deadline fires, the user expects to know about it promptly. A deadline > > callback that gets stuck behind the same bottleneck it's trying to alert > > you about isn't useful. > > > > To that end, what I implemented was effectively two queues: > > task_instances get queued by priority_weight (as they "always" have), and > > callbacks get queued in FIFO order. When the scheduler looks for new > work > > it will always fill slots from the callback queue first. Parallelism is > > honored, but pools and max_active_tasks_per_dag are ignored. The reason > > for that is twofold. Let's say a user has a Dag with a deadline at > > 10-minutes. There are two cases where that deadline can be triggered: > > > > Case 1) This Dag is still running at the deadline, the callback triggers > - > > maybe a Slack message that the report is still being generated and may be > > delayed today - while the Dag still runs. > > > > In this case, `pools` and `tasks_per_dag` may make sense as the dag is > > still active, but I'd argue that in this case the deadline should "break > > through" that pool/task limit and execute regardless. For example, if > the > > issue is that the tasks stalled because the worker is hitting resource > > constraints then tacking the deadline callback on behind that roadblock > and > > waiting for them to finish before alerting the user that there is a > problem > > defeats the intent of the Deadline. It is still bound by the > > executor-level parallelism so it's not allowed to just run rampant, but > it > > isn't bound by the dag-level constraints. > > > > Case 2) The Dag failed at 2 minutes and the callback is triggered at the > > 2-minute point. At this point there is no `pool` or `max_tasks` to > > consider. The task instances should have released their slots and aren't > > being counted toward the active tasks. Even if other Dags have started > up > > and are claiming pool slots, this callback isn't part of that Dag and > > shouldn't be lumped with its tasks. > > > > A `max_executor_callbacks` setting which parallels max_executor_tasks is > > one idea that has come up. It's not a bad idea; I guess if the whole > > building is burning down, you don't need to know that each floor is > > burning. It feels a bit against the intention of the callbacks getting > > prioritized over tasks, but if it's a user-defined option then that's on > > the user. It might be worth considering as a follow-up improvement if > > anyone thinks it's something we really need. > > > > The other decision that seems to be contentious is that callbacks are > FIFO > > instead of implementing a `callback-priority-weight`. FIFO seems fine > for > > a callback queue and how I envision the feature being used, but maybe > once > > the feature gets in the hands of users they'll find a need for it. We > as a > > community have been saying for a while that there are way too many > > user-facing knobs in the settings and this felt like a logical place for > us > > to be opinionated in the code. With FIFO, the feature's behavior is > > straightforward and predictable, and it's easier to add the weight later > if > > there is demand than it would be to remove it later if we decide to prune > > the config options in a future version. > > > > For now, I think we're in a good place to ship with the current behavior > > and we can iterate based on real-world usage. I'll cut some Issues to > make > > sure the follow-up ideas from here and the PR are tracked so they don't > get > > lost. If anyone has concerns about the current behavior that should be > > addressed before launch, let me know. > > > > > > * > > ferruzzi > > >
