Re: [DISCUSS] Deadline Alert Callbacks

Jarek Potiuk Thu, 22 May 2025 08:07:45 -0700

> Option 1 as originally proposed in this thread only does the "is a
callback
required" check in scheduler -- not running the callback in scheduler


Ah - then OK. I thought it's also callback execution. Just checking is fine
in scheduler.

On Thu, May 22, 2025 at 4:55 PM Daniel Standish
<daniel.stand...@astronomer.io.invalid> wrote:

> Option 1 as originally proposed in this thread only does the "is a callback
> required" check in scheduler -- not running the callback in scheduler.
>
> On Thu, May 22, 2025 at 7:22 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > > So I very strongly vote for Option 1, and if needed make the scheduler
> > itself more resilient. The Airflow Scheduler _IS_ airflow. Let’s do what
> we
> > need to in order to make it more stable, rather than working around a
> > problem of our own making, whilst also making it operationally more
> complex
> > to run.
> >
> > Hey Ash - I forgot to add. Option 1 is against our new security model.
> This
> > is essentially DAG author code executed in the scheduler. Ash - do you
> > think it is possible to avoid that ? For DAG parsing it resulted with
> > mandatory dag-processor command separated from scheduler, so I am not
> sure
> > how we would solve the security issue here? Or maybe there is another
> idea
> > on how to solve it? That would be possible if we had deadline callbacks
> > defined in the plugins, but again - I think the idea was to be able to
> > provide callbacks by DAG authors (which IMHO is synonymous with "we do
> not
> > run it in scheduler".
> >
> > We could potentially run the callbacks in the Dag processor (which we
> > already did BTW). but I am not sure if this is what we want.
> >
> > J.
> >
> >
> > On Thu, May 22, 2025 at 3:40 PM Elad Kalif <elad...@apache.org> wrote:
> >
> > > My comment on the name is for the suggested component that runs the
> > > workload. It's not about the feature itself. I just suggest a more
> > generic
> > > name so if the need comes it would be easier to execute different kind
> of
> > > workloads on it (like callbacks).
> > >
> > > As for reuse the Triggerer I am not a fan of that. It serve a
> completely
> > > different porpuse and combining both cases may result in poor usage of
> > auto
> > > scaling. I don't think alerts/callbacks/other "misc" should compete on
> > the
> > > same resources as actual tasks.
> > >
> > > בתאריך יום ה׳, 22 במאי 2025, 16:19, מאת Jarek Potiuk ‏<
> ja...@potiuk.com
> > >:
> > >
> > > > How about Option 3) making it part of triggerer.
> > > >
> > > > I think that goes in the direction we've been discussing in the past
> > > where
> > > > we have 'generic workload" that we can submit from any of the other
> > > > components that will be executed in triggerer.
> > > >
> > > > * that would not add too much complexity - no extra process to manage
> > > > * triggerer is obligatory part of installation now anyway
> > > > * usually machines today have more processors and triggerer, with its
> > > event
> > > > loop does not seem to be too busy in terms of multi-processor usage
> > > (there
> > > > are extra processes accessing the DB but still not much I think). It
> > > could
> > > > fork another process to run just deadline checks.
> > > > * re - multi-team it's even easier, triggerer is already going to be
> > > > "per-team".
> > > > * we could even rename triggerer to "generic workload processor"
> (well
> > > > shorter name, but to indicate that it could process any kind of
> > > workloads -
> > > > not only deferred triggers).
> > > >
> > > > Re: comments from Elad:
> > > >
> > > > 1) Naming wise: I think we settled on the name already (looong
> > > discussion,
> > > > naming is hard) and I think the scope of it is just really
> "deadlines"
> > > (we
> > > > also wanted to distinguish it from SLA) - i like the name for this
> > > > particular callback type, but yes - I agree it should be more
> generic,
> > > open
> > > > for any future types of callbacks. If we go for triggerer handling
> > > "generic
> > > > workload" - that is IMHO "generic enough" to handle any future
> > workloads
> > > >
> > > > 2) I believe this is something that could be handled by the callback.
> > > > Callback could have the option to be able to submit "cancel" request
> > for
> > > > the task it is called back for (via task.sdk API)  - but that should
> be
> > > up
> > > > to the one who writes the callback.
> > > >
> > > > J.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, May 22, 2025 at 10:03 AM Elad Kalif <elad...@apache.org>
> > wrote:
> > > >
> > > > > I prefer option 2 but I have questions.
> > > > > 1. Naming wise maybe we should prefer a more generic name as I am
> not
> > > > sure
> > > > > if it should be limited to deadlines? (maybe should be shared with
> > > > > executing callbacks?)
> > > > > 2. How do you plan to manage the queue of alerts? What happens if
> the
> > > > > process is unhealthy while workers continue to execute tasks?
> > > > >
> > > > > On Thu, May 22, 2025 at 12:56 AM Ryan Hatter
> > > > > <ryan.hat...@astronomer.io.invalid> wrote:
> > > > >
> > > > > > +1 for option 2, primarily because of:
> > > > > >
> > > > > >  It would be more robust and resilient, and therefore be able to
> > run
> > > > the
> > > > > > > callbacks *even in presence of certain kinds of issues like the
> > > > > scheduler
> > > > > > > being bogged-down*
> > > > > >
> > > > > >
> > > > > > On Wed, May 21, 2025 at 5:09 PM Kataria, Ramit
> > > > > <ramit...@amazon.com.invalid
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I’m working with Dennis on Deadline Alerts (AIP-86). I'd like
> to
> > > > > discuss
> > > > > > > implementation approaches for executing callbacks when Deadline
> > > > Alerts
> > > > > > are
> > > > > > > triggered. As you may know, the old SLA feature has been
> removed,
> > > and
> > > > > > we're
> > > > > > > planning to introduce Deadline Alerts as a replacement in 3.1.
> > > When a
> > > > > > > deadline is missed, we need a mechanism to execute callbacks
> > (which
> > > > > could
> > > > > > > be notifications or other actions).
> > > > > > >
> > > > > > > I’ve identified two main approaches:
> > > > > > >
> > > > > > > Option 1: Scheduler-based
> > > > > > > In this approach, the scheduler would check on a regular
> interval
> > > to
> > > > > see
> > > > > > > if the earliest deadline has passed and then queue the callback
> > to
> > > > run
> > > > > in
> > > > > > > an executor (local or remote). The executor would be specified
> > when
> > > > > > > creating the deadline alert and if there’s none specified, then
> > the
> > > > > > default
> > > > > > > executor would be used.
> > > > > > >
> > > > > > > Option 2: New DeadlineProcessor process
> > > > > > > In this approach, there would be a new process similar to
> > > > > > > triggerer/dag-processor completely independent from the
> scheduler
> > > to
> > > > > > check
> > > > > > > for deadlines on a regular interval and also run the callbacks
> > > > without
> > > > > > > queueing it in another executor.
> > > > > > >
> > > > > > > Multi-team considerations: For multi-team later this year,
> > option 2
> > > > > would
> > > > > > > be relatively simple to implement. However, for option 1, the
> > > > callbacks
> > > > > > > would have to run on a remote executor since there would be no
> > > local
> > > > > > > executor.
> > > > > > >
> > > > > > > I recommend going with option 2 because:
> > > > > > >
> > > > > > >   *   It would be more robust and resilient, and therefore be
> > able
> > > to
> > > > > run
> > > > > > > the callbacks even in presence of certain kinds of issues like
> > the
> > > > > > > scheduler being bogged-down
> > > > > > >   *   It would also run the callbacks almost instantly instead
> of
> > > > > having
> > > > > > > to wait for an executor (especially if there’s a long queue of
> > > tasks
> > > > > or a
> > > > > > > cold-start delay)
> > > > > > >      *   This could be mitigated by implementing a priority
> > system
> > > > > where
> > > > > > > the deadline callbacks are prioritized over regular tasks but
> > this
> > > > is a
> > > > > > > non-trivial problem with my current understanding of Airflow’s
> > > > > > architecture
> > > > > > >   *   It would avoid a potential slight increase in workload
> for
> > > the
> > > > > > > scheduler
> > > > > > >      *   The additional workload in the scheduler for option 1
> > > would
> > > > be
> > > > > > > checking to see if the earliest deadline has passed on a
> regular
> > > > > interval
> > > > > > >
> > > > > > > However, it would introduce another process for admins to
> deploy
> > > and
> > > > > > > manage, and also likely require more effort to implement,
> > therefore
> > > > > > taking
> > > > > > > longer to complete.
> > > > > > >
> > > > > > > So, I’d like to hear your thoughts on these approaches,
> anything
> > I
> > > > may
> > > > > > > have missed and if you agree/disagree with this direction.
> Thank
> > > you
> > > > > for
> > > > > > > your input!
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Ramit Kataria
> > > > > > > SDE at AWS
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Deadline Alert Callbacks

Reply via email to