I’m -1 on making it a new separate process — we already have too many than I feel comfortable with, and each new one we add seriously impacts the ability to usefully run “small” airflow deployments.
I also think I disagree with the framing of your original message: > * It would be more robust and resilient, and therefore be able to run the > callbacks even in presence of certain kinds of issues like the scheduler > being bogged-down If this happens almost every core functionality of Airflow itself breaks. At a more fundamental: Does this actually happen? How often does a scheduler actually get bogged down to the point that the schedule loop can’t run frequently enough? I’m not talking about “it can’t schedule a task or a dag” but “it can’t actually run its core loop and it stops heartbeating”. So I very strongly vote for Option 1, and if needed make the scheduler itself more resilient. The Airflow Scheduler _IS_ airflow. Let’s do what we need to in order to make it more stable, rather than working around a problem of our own making, whilst also making it operationally more complex to run. * It would avoid a potential slight increase in workload for the scheduler * The additional workload in the scheduler for option 1 would be checking to see if the earliest deadline has passed on a regular interval [Citation needed] — this should be a _very_ quick indexed query, (to the tune of `select from … where deadline date < now() limit n` right?) and so the impact on the scheduler should be almost un-noticable. If you think this will cause a noticeable impact then you need to show us some before and after data. (Where the callbacks run I don’t mind, Triggered could be a code solution there.) -ash > On 22 May 2025, at 14:19, Jarek Potiuk <ja...@potiuk.com> wrote: > > How about Option 3) making it part of triggerer. > > I think that goes in the direction we've been discussing in the past where > we have 'generic workload" that we can submit from any of the other > components that will be executed in triggerer. > > * that would not add too much complexity - no extra process to manage > * triggerer is obligatory part of installation now anyway > * usually machines today have more processors and triggerer, with its event > loop does not seem to be too busy in terms of multi-processor usage (there > are extra processes accessing the DB but still not much I think). It could > fork another process to run just deadline checks. > * re - multi-team it's even easier, triggerer is already going to be > "per-team". > * we could even rename triggerer to "generic workload processor" (well > shorter name, but to indicate that it could process any kind of workloads - > not only deferred triggers). > > Re: comments from Elad: > > 1) Naming wise: I think we settled on the name already (looong discussion, > naming is hard) and I think the scope of it is just really "deadlines" (we > also wanted to distinguish it from SLA) - i like the name for this > particular callback type, but yes - I agree it should be more generic, open > for any future types of callbacks. If we go for triggerer handling "generic > workload" - that is IMHO "generic enough" to handle any future workloads > > 2) I believe this is something that could be handled by the callback. > Callback could have the option to be able to submit "cancel" request for > the task it is called back for (via task.sdk API) - but that should be up > to the one who writes the callback. > > J. > > > > > > > On Thu, May 22, 2025 at 10:03 AM Elad Kalif <elad...@apache.org> wrote: > >> I prefer option 2 but I have questions. >> 1. Naming wise maybe we should prefer a more generic name as I am not sure >> if it should be limited to deadlines? (maybe should be shared with >> executing callbacks?) >> 2. How do you plan to manage the queue of alerts? What happens if the >> process is unhealthy while workers continue to execute tasks? >> >> On Thu, May 22, 2025 at 12:56 AM Ryan Hatter >> <ryan.hat...@astronomer.io.invalid> wrote: >> >>> +1 for option 2, primarily because of: >>> >>> It would be more robust and resilient, and therefore be able to run the >>>> callbacks *even in presence of certain kinds of issues like the >> scheduler >>>> being bogged-down* >>> >>> >>> On Wed, May 21, 2025 at 5:09 PM Kataria, Ramit >> <ramit...@amazon.com.invalid >>>> >>> wrote: >>> >>>> Hi all, >>>> >>>> I’m working with Dennis on Deadline Alerts (AIP-86). I'd like to >> discuss >>>> implementation approaches for executing callbacks when Deadline Alerts >>> are >>>> triggered. As you may know, the old SLA feature has been removed, and >>> we're >>>> planning to introduce Deadline Alerts as a replacement in 3.1. When a >>>> deadline is missed, we need a mechanism to execute callbacks (which >> could >>>> be notifications or other actions). >>>> >>>> I’ve identified two main approaches: >>>> >>>> Option 1: Scheduler-based >>>> In this approach, the scheduler would check on a regular interval to >> see >>>> if the earliest deadline has passed and then queue the callback to run >> in >>>> an executor (local or remote). The executor would be specified when >>>> creating the deadline alert and if there’s none specified, then the >>> default >>>> executor would be used. >>>> >>>> Option 2: New DeadlineProcessor process >>>> In this approach, there would be a new process similar to >>>> triggerer/dag-processor completely independent from the scheduler to >>> check >>>> for deadlines on a regular interval and also run the callbacks without >>>> queueing it in another executor. >>>> >>>> Multi-team considerations: For multi-team later this year, option 2 >> would >>>> be relatively simple to implement. However, for option 1, the callbacks >>>> would have to run on a remote executor since there would be no local >>>> executor. >>>> >>>> I recommend going with option 2 because: >>>> >>>> * It would be more robust and resilient, and therefore be able to >> run >>>> the callbacks even in presence of certain kinds of issues like the >>>> scheduler being bogged-down >>>> * It would also run the callbacks almost instantly instead of >> having >>>> to wait for an executor (especially if there’s a long queue of tasks >> or a >>>> cold-start delay) >>>> * This could be mitigated by implementing a priority system >> where >>>> the deadline callbacks are prioritized over regular tasks but this is a >>>> non-trivial problem with my current understanding of Airflow’s >>> architecture >>>> * It would avoid a potential slight increase in workload for the >>>> scheduler >>>> * The additional workload in the scheduler for option 1 would be >>>> checking to see if the earliest deadline has passed on a regular >> interval >>>> >>>> However, it would introduce another process for admins to deploy and >>>> manage, and also likely require more effort to implement, therefore >>> taking >>>> longer to complete. >>>> >>>> So, I’d like to hear your thoughts on these approaches, anything I may >>>> have missed and if you agree/disagree with this direction. Thank you >> for >>>> your input! >>>> >>>> >>>> Best, >>>> >>>> Ramit Kataria >>>> SDE at AWS >>>> >>> >>