+1 for option 2, primarily because of: It would be more robust and resilient, and therefore be able to run the > callbacks *even in presence of certain kinds of issues like the scheduler > being bogged-down*
On Wed, May 21, 2025 at 5:09 PM Kataria, Ramit <ramit...@amazon.com.invalid> wrote: > Hi all, > > I’m working with Dennis on Deadline Alerts (AIP-86). I'd like to discuss > implementation approaches for executing callbacks when Deadline Alerts are > triggered. As you may know, the old SLA feature has been removed, and we're > planning to introduce Deadline Alerts as a replacement in 3.1. When a > deadline is missed, we need a mechanism to execute callbacks (which could > be notifications or other actions). > > I’ve identified two main approaches: > > Option 1: Scheduler-based > In this approach, the scheduler would check on a regular interval to see > if the earliest deadline has passed and then queue the callback to run in > an executor (local or remote). The executor would be specified when > creating the deadline alert and if there’s none specified, then the default > executor would be used. > > Option 2: New DeadlineProcessor process > In this approach, there would be a new process similar to > triggerer/dag-processor completely independent from the scheduler to check > for deadlines on a regular interval and also run the callbacks without > queueing it in another executor. > > Multi-team considerations: For multi-team later this year, option 2 would > be relatively simple to implement. However, for option 1, the callbacks > would have to run on a remote executor since there would be no local > executor. > > I recommend going with option 2 because: > > * It would be more robust and resilient, and therefore be able to run > the callbacks even in presence of certain kinds of issues like the > scheduler being bogged-down > * It would also run the callbacks almost instantly instead of having > to wait for an executor (especially if there’s a long queue of tasks or a > cold-start delay) > * This could be mitigated by implementing a priority system where > the deadline callbacks are prioritized over regular tasks but this is a > non-trivial problem with my current understanding of Airflow’s architecture > * It would avoid a potential slight increase in workload for the > scheduler > * The additional workload in the scheduler for option 1 would be > checking to see if the earliest deadline has passed on a regular interval > > However, it would introduce another process for admins to deploy and > manage, and also likely require more effort to implement, therefore taking > longer to complete. > > So, I’d like to hear your thoughts on these approaches, anything I may > have missed and if you agree/disagree with this direction. Thank you for > your input! > > > Best, > > Ramit Kataria > SDE at AWS >