How about Option 3) making it part of triggerer.

I think that goes in the direction we've been discussing in the past where
we have 'generic workload" that we can submit from any of the other
components that will be executed in triggerer.

* that would not add too much complexity - no extra process to manage
* triggerer is obligatory part of installation now anyway
* usually machines today have more processors and triggerer, with its event
loop does not seem to be too busy in terms of multi-processor usage (there
are extra processes accessing the DB but still not much I think). It could
fork another process to run just deadline checks.
* re - multi-team it's even easier, triggerer is already going to be
"per-team".
* we could even rename triggerer to "generic workload processor" (well
shorter name, but to indicate that it could process any kind of workloads -
not only deferred triggers).

Re: comments from Elad:

1) Naming wise: I think we settled on the name already (looong discussion,
naming is hard) and I think the scope of it is just really "deadlines" (we
also wanted to distinguish it from SLA) - i like the name for this
particular callback type, but yes - I agree it should be more generic, open
for any future types of callbacks. If we go for triggerer handling "generic
workload" - that is IMHO "generic enough" to handle any future workloads

2) I believe this is something that could be handled by the callback.
Callback could have the option to be able to submit "cancel" request for
the task it is called back for (via task.sdk API)  - but that should be up
to the one who writes the callback.

J.






On Thu, May 22, 2025 at 10:03 AM Elad Kalif <elad...@apache.org> wrote:

> I prefer option 2 but I have questions.
> 1. Naming wise maybe we should prefer a more generic name as I am not sure
> if it should be limited to deadlines? (maybe should be shared with
> executing callbacks?)
> 2. How do you plan to manage the queue of alerts? What happens if the
> process is unhealthy while workers continue to execute tasks?
>
> On Thu, May 22, 2025 at 12:56 AM Ryan Hatter
> <ryan.hat...@astronomer.io.invalid> wrote:
>
> > +1 for option 2, primarily because of:
> >
> >  It would be more robust and resilient, and therefore be able to run the
> > > callbacks *even in presence of certain kinds of issues like the
> scheduler
> > > being bogged-down*
> >
> >
> > On Wed, May 21, 2025 at 5:09 PM Kataria, Ramit
> <ramit...@amazon.com.invalid
> > >
> > wrote:
> >
> > > Hi all,
> > >
> > > I’m working with Dennis on Deadline Alerts (AIP-86). I'd like to
> discuss
> > > implementation approaches for executing callbacks when Deadline Alerts
> > are
> > > triggered. As you may know, the old SLA feature has been removed, and
> > we're
> > > planning to introduce Deadline Alerts as a replacement in 3.1. When a
> > > deadline is missed, we need a mechanism to execute callbacks (which
> could
> > > be notifications or other actions).
> > >
> > > I’ve identified two main approaches:
> > >
> > > Option 1: Scheduler-based
> > > In this approach, the scheduler would check on a regular interval to
> see
> > > if the earliest deadline has passed and then queue the callback to run
> in
> > > an executor (local or remote). The executor would be specified when
> > > creating the deadline alert and if there’s none specified, then the
> > default
> > > executor would be used.
> > >
> > > Option 2: New DeadlineProcessor process
> > > In this approach, there would be a new process similar to
> > > triggerer/dag-processor completely independent from the scheduler to
> > check
> > > for deadlines on a regular interval and also run the callbacks without
> > > queueing it in another executor.
> > >
> > > Multi-team considerations: For multi-team later this year, option 2
> would
> > > be relatively simple to implement. However, for option 1, the callbacks
> > > would have to run on a remote executor since there would be no local
> > > executor.
> > >
> > > I recommend going with option 2 because:
> > >
> > >   *   It would be more robust and resilient, and therefore be able to
> run
> > > the callbacks even in presence of certain kinds of issues like the
> > > scheduler being bogged-down
> > >   *   It would also run the callbacks almost instantly instead of
> having
> > > to wait for an executor (especially if there’s a long queue of tasks
> or a
> > > cold-start delay)
> > >      *   This could be mitigated by implementing a priority system
> where
> > > the deadline callbacks are prioritized over regular tasks but this is a
> > > non-trivial problem with my current understanding of Airflow’s
> > architecture
> > >   *   It would avoid a potential slight increase in workload for the
> > > scheduler
> > >      *   The additional workload in the scheduler for option 1 would be
> > > checking to see if the earliest deadline has passed on a regular
> interval
> > >
> > > However, it would introduce another process for admins to deploy and
> > > manage, and also likely require more effort to implement, therefore
> > taking
> > > longer to complete.
> > >
> > > So, I’d like to hear your thoughts on these approaches, anything I may
> > > have missed and if you agree/disagree with this direction. Thank you
> for
> > > your input!
> > >
> > >
> > > Best,
> > >
> > > Ramit Kataria
> > > SDE at AWS
> > >
> >
>

Reply via email to