Re: [DISCUSS] Deadline Alert Callbacks

Ash Berlin-Taylor Thu, 22 May 2025 06:35:28 -0700

I’m -1 on making it a new separate process — we already have too many than I 
feel comfortable with, and each new one we add seriously impacts the ability to 
usefully run “small” airflow deployments.


I also think I disagree with the framing of your original message:

>  *   It would be more robust and resilient, and therefore be able to run the 
> callbacks even in presence of certain kinds of issues like the scheduler 
> being bogged-down

If this happens almost every core functionality of Airflow itself breaks. At a 
more fundamental: Does this actually happen? How often does a scheduler 
actually get bogged down to the point that the schedule loop can’t run 
frequently enough? I’m not talking about “it can’t schedule a task or a dag” 
but “it can’t actually run its core loop and it stops heartbeating”.

So I very strongly vote for Option 1, and if needed make the scheduler itself 
more resilient. The Airflow Scheduler _IS_ airflow. Let’s do what we need to in 
order to make it more stable, rather than working around a problem of our own 
making, whilst also making it operationally more complex to run.

 *   It would avoid a potential slight increase in workload for the scheduler
    *   The additional workload in the scheduler for option 1 would be checking 
to see if the earliest deadline has passed on a regular interval

[Citation needed] — this should be a _very_ quick indexed query, (to the tune 
of `select from … where deadline date < now() limit n` right?)  and so the 
impact on the scheduler should be almost un-noticable. If you think this will 
cause a noticeable impact then you need to show us some before and after data.

(Where the callbacks run I don’t mind, Triggered could be a code solution 
there.)

-ash
 

> On 22 May 2025, at 14:19, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> How about Option 3) making it part of triggerer.
> 
> I think that goes in the direction we've been discussing in the past where
> we have 'generic workload" that we can submit from any of the other
> components that will be executed in triggerer.
> 
> * that would not add too much complexity - no extra process to manage
> * triggerer is obligatory part of installation now anyway
> * usually machines today have more processors and triggerer, with its event
> loop does not seem to be too busy in terms of multi-processor usage (there
> are extra processes accessing the DB but still not much I think). It could
> fork another process to run just deadline checks.
> * re - multi-team it's even easier, triggerer is already going to be
> "per-team".
> * we could even rename triggerer to "generic workload processor" (well
> shorter name, but to indicate that it could process any kind of workloads -
> not only deferred triggers).
> 
> Re: comments from Elad:
> 
> 1) Naming wise: I think we settled on the name already (looong discussion,
> naming is hard) and I think the scope of it is just really "deadlines" (we
> also wanted to distinguish it from SLA) - i like the name for this
> particular callback type, but yes - I agree it should be more generic, open
> for any future types of callbacks. If we go for triggerer handling "generic
> workload" - that is IMHO "generic enough" to handle any future workloads
> 
> 2) I believe this is something that could be handled by the callback.
> Callback could have the option to be able to submit "cancel" request for
> the task it is called back for (via task.sdk API)  - but that should be up
> to the one who writes the callback.
> 
> J.
> 
> 
> 
> 
> 
> 
> On Thu, May 22, 2025 at 10:03 AM Elad Kalif <elad...@apache.org> wrote:
> 
>> I prefer option 2 but I have questions.
>> 1. Naming wise maybe we should prefer a more generic name as I am not sure
>> if it should be limited to deadlines? (maybe should be shared with
>> executing callbacks?)
>> 2. How do you plan to manage the queue of alerts? What happens if the
>> process is unhealthy while workers continue to execute tasks?
>> 
>> On Thu, May 22, 2025 at 12:56 AM Ryan Hatter
>> <ryan.hat...@astronomer.io.invalid> wrote:
>> 
>>> +1 for option 2, primarily because of:
>>> 
>>> It would be more robust and resilient, and therefore be able to run the
>>>> callbacks *even in presence of certain kinds of issues like the
>> scheduler
>>>> being bogged-down*
>>> 
>>> 
>>> On Wed, May 21, 2025 at 5:09 PM Kataria, Ramit
>> <ramit...@amazon.com.invalid
>>>> 
>>> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I’m working with Dennis on Deadline Alerts (AIP-86). I'd like to
>> discuss
>>>> implementation approaches for executing callbacks when Deadline Alerts
>>> are
>>>> triggered. As you may know, the old SLA feature has been removed, and
>>> we're
>>>> planning to introduce Deadline Alerts as a replacement in 3.1. When a
>>>> deadline is missed, we need a mechanism to execute callbacks (which
>> could
>>>> be notifications or other actions).
>>>> 
>>>> I’ve identified two main approaches:
>>>> 
>>>> Option 1: Scheduler-based
>>>> In this approach, the scheduler would check on a regular interval to
>> see
>>>> if the earliest deadline has passed and then queue the callback to run
>> in
>>>> an executor (local or remote). The executor would be specified when
>>>> creating the deadline alert and if there’s none specified, then the
>>> default
>>>> executor would be used.
>>>> 
>>>> Option 2: New DeadlineProcessor process
>>>> In this approach, there would be a new process similar to
>>>> triggerer/dag-processor completely independent from the scheduler to
>>> check
>>>> for deadlines on a regular interval and also run the callbacks without
>>>> queueing it in another executor.
>>>> 
>>>> Multi-team considerations: For multi-team later this year, option 2
>> would
>>>> be relatively simple to implement. However, for option 1, the callbacks
>>>> would have to run on a remote executor since there would be no local
>>>> executor.
>>>> 
>>>> I recommend going with option 2 because:
>>>> 
>>>>  *   It would be more robust and resilient, and therefore be able to
>> run
>>>> the callbacks even in presence of certain kinds of issues like the
>>>> scheduler being bogged-down
>>>>  *   It would also run the callbacks almost instantly instead of
>> having
>>>> to wait for an executor (especially if there’s a long queue of tasks
>> or a
>>>> cold-start delay)
>>>>     *   This could be mitigated by implementing a priority system
>> where
>>>> the deadline callbacks are prioritized over regular tasks but this is a
>>>> non-trivial problem with my current understanding of Airflow’s
>>> architecture
>>>>  *   It would avoid a potential slight increase in workload for the
>>>> scheduler
>>>>     *   The additional workload in the scheduler for option 1 would be
>>>> checking to see if the earliest deadline has passed on a regular
>> interval
>>>> 
>>>> However, it would introduce another process for admins to deploy and
>>>> manage, and also likely require more effort to implement, therefore
>>> taking
>>>> longer to complete.
>>>> 
>>>> So, I’d like to hear your thoughts on these approaches, anything I may
>>>> have missed and if you agree/disagree with this direction. Thank you
>> for
>>>> your input!
>>>> 
>>>> 
>>>> Best,
>>>> 
>>>> Ramit Kataria
>>>> SDE at AWS
>>>> 
>>> 
>>

Re: [DISCUSS] Deadline Alert Callbacks

Reply via email to