> Option 1 as originally proposed in this thread only does the "is a callback required" check in scheduler -- not running the callback in scheduler
Ah - then OK. I thought it's also callback execution. Just checking is fine in scheduler. On Thu, May 22, 2025 at 4:55 PM Daniel Standish <daniel.stand...@astronomer.io.invalid> wrote: > Option 1 as originally proposed in this thread only does the "is a callback > required" check in scheduler -- not running the callback in scheduler. > > On Thu, May 22, 2025 at 7:22 AM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > So I very strongly vote for Option 1, and if needed make the scheduler > > itself more resilient. The Airflow Scheduler _IS_ airflow. Let’s do what > we > > need to in order to make it more stable, rather than working around a > > problem of our own making, whilst also making it operationally more > complex > > to run. > > > > Hey Ash - I forgot to add. Option 1 is against our new security model. > This > > is essentially DAG author code executed in the scheduler. Ash - do you > > think it is possible to avoid that ? For DAG parsing it resulted with > > mandatory dag-processor command separated from scheduler, so I am not > sure > > how we would solve the security issue here? Or maybe there is another > idea > > on how to solve it? That would be possible if we had deadline callbacks > > defined in the plugins, but again - I think the idea was to be able to > > provide callbacks by DAG authors (which IMHO is synonymous with "we do > not > > run it in scheduler". > > > > We could potentially run the callbacks in the Dag processor (which we > > already did BTW). but I am not sure if this is what we want. > > > > J. > > > > > > On Thu, May 22, 2025 at 3:40 PM Elad Kalif <elad...@apache.org> wrote: > > > > > My comment on the name is for the suggested component that runs the > > > workload. It's not about the feature itself. I just suggest a more > > generic > > > name so if the need comes it would be easier to execute different kind > of > > > workloads on it (like callbacks). > > > > > > As for reuse the Triggerer I am not a fan of that. It serve a > completely > > > different porpuse and combining both cases may result in poor usage of > > auto > > > scaling. I don't think alerts/callbacks/other "misc" should compete on > > the > > > same resources as actual tasks. > > > > > > בתאריך יום ה׳, 22 במאי 2025, 16:19, מאת Jarek Potiuk < > ja...@potiuk.com > > >: > > > > > > > How about Option 3) making it part of triggerer. > > > > > > > > I think that goes in the direction we've been discussing in the past > > > where > > > > we have 'generic workload" that we can submit from any of the other > > > > components that will be executed in triggerer. > > > > > > > > * that would not add too much complexity - no extra process to manage > > > > * triggerer is obligatory part of installation now anyway > > > > * usually machines today have more processors and triggerer, with its > > > event > > > > loop does not seem to be too busy in terms of multi-processor usage > > > (there > > > > are extra processes accessing the DB but still not much I think). It > > > could > > > > fork another process to run just deadline checks. > > > > * re - multi-team it's even easier, triggerer is already going to be > > > > "per-team". > > > > * we could even rename triggerer to "generic workload processor" > (well > > > > shorter name, but to indicate that it could process any kind of > > > workloads - > > > > not only deferred triggers). > > > > > > > > Re: comments from Elad: > > > > > > > > 1) Naming wise: I think we settled on the name already (looong > > > discussion, > > > > naming is hard) and I think the scope of it is just really > "deadlines" > > > (we > > > > also wanted to distinguish it from SLA) - i like the name for this > > > > particular callback type, but yes - I agree it should be more > generic, > > > open > > > > for any future types of callbacks. If we go for triggerer handling > > > "generic > > > > workload" - that is IMHO "generic enough" to handle any future > > workloads > > > > > > > > 2) I believe this is something that could be handled by the callback. > > > > Callback could have the option to be able to submit "cancel" request > > for > > > > the task it is called back for (via task.sdk API) - but that should > be > > > up > > > > to the one who writes the callback. > > > > > > > > J. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, May 22, 2025 at 10:03 AM Elad Kalif <elad...@apache.org> > > wrote: > > > > > > > > > I prefer option 2 but I have questions. > > > > > 1. Naming wise maybe we should prefer a more generic name as I am > not > > > > sure > > > > > if it should be limited to deadlines? (maybe should be shared with > > > > > executing callbacks?) > > > > > 2. How do you plan to manage the queue of alerts? What happens if > the > > > > > process is unhealthy while workers continue to execute tasks? > > > > > > > > > > On Thu, May 22, 2025 at 12:56 AM Ryan Hatter > > > > > <ryan.hat...@astronomer.io.invalid> wrote: > > > > > > > > > > > +1 for option 2, primarily because of: > > > > > > > > > > > > It would be more robust and resilient, and therefore be able to > > run > > > > the > > > > > > > callbacks *even in presence of certain kinds of issues like the > > > > > scheduler > > > > > > > being bogged-down* > > > > > > > > > > > > > > > > > > On Wed, May 21, 2025 at 5:09 PM Kataria, Ramit > > > > > <ramit...@amazon.com.invalid > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > I’m working with Dennis on Deadline Alerts (AIP-86). I'd like > to > > > > > discuss > > > > > > > implementation approaches for executing callbacks when Deadline > > > > Alerts > > > > > > are > > > > > > > triggered. As you may know, the old SLA feature has been > removed, > > > and > > > > > > we're > > > > > > > planning to introduce Deadline Alerts as a replacement in 3.1. > > > When a > > > > > > > deadline is missed, we need a mechanism to execute callbacks > > (which > > > > > could > > > > > > > be notifications or other actions). > > > > > > > > > > > > > > I’ve identified two main approaches: > > > > > > > > > > > > > > Option 1: Scheduler-based > > > > > > > In this approach, the scheduler would check on a regular > interval > > > to > > > > > see > > > > > > > if the earliest deadline has passed and then queue the callback > > to > > > > run > > > > > in > > > > > > > an executor (local or remote). The executor would be specified > > when > > > > > > > creating the deadline alert and if there’s none specified, then > > the > > > > > > default > > > > > > > executor would be used. > > > > > > > > > > > > > > Option 2: New DeadlineProcessor process > > > > > > > In this approach, there would be a new process similar to > > > > > > > triggerer/dag-processor completely independent from the > scheduler > > > to > > > > > > check > > > > > > > for deadlines on a regular interval and also run the callbacks > > > > without > > > > > > > queueing it in another executor. > > > > > > > > > > > > > > Multi-team considerations: For multi-team later this year, > > option 2 > > > > > would > > > > > > > be relatively simple to implement. However, for option 1, the > > > > callbacks > > > > > > > would have to run on a remote executor since there would be no > > > local > > > > > > > executor. > > > > > > > > > > > > > > I recommend going with option 2 because: > > > > > > > > > > > > > > * It would be more robust and resilient, and therefore be > > able > > > to > > > > > run > > > > > > > the callbacks even in presence of certain kinds of issues like > > the > > > > > > > scheduler being bogged-down > > > > > > > * It would also run the callbacks almost instantly instead > of > > > > > having > > > > > > > to wait for an executor (especially if there’s a long queue of > > > tasks > > > > > or a > > > > > > > cold-start delay) > > > > > > > * This could be mitigated by implementing a priority > > system > > > > > where > > > > > > > the deadline callbacks are prioritized over regular tasks but > > this > > > > is a > > > > > > > non-trivial problem with my current understanding of Airflow’s > > > > > > architecture > > > > > > > * It would avoid a potential slight increase in workload > for > > > the > > > > > > > scheduler > > > > > > > * The additional workload in the scheduler for option 1 > > > would > > > > be > > > > > > > checking to see if the earliest deadline has passed on a > regular > > > > > interval > > > > > > > > > > > > > > However, it would introduce another process for admins to > deploy > > > and > > > > > > > manage, and also likely require more effort to implement, > > therefore > > > > > > taking > > > > > > > longer to complete. > > > > > > > > > > > > > > So, I’d like to hear your thoughts on these approaches, > anything > > I > > > > may > > > > > > > have missed and if you agree/disagree with this direction. > Thank > > > you > > > > > for > > > > > > > your input! > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Ramit Kataria > > > > > > > SDE at AWS > > > > > > > > > > > > > > > > > > > > > > > > > > > >