Hi Daniel, Thank you for the review! I'm happy to keep having the discussion to make sure we can introduce the right way of implementing these solutions into Airflow.
My general impression from the community in the discussions so far led me to believe that deprecating the problematic feature would be welcomed at large, as long as we can educate folks about how they can achieve a similar outcome by using Deferrable Operators - which will be much less problematic for the Airflow core infrastructure. But as you say, if you think it is important to maintain the feature until we can find a suitable, systematic replacement for it, I'm happy to hold off on the implementations until we can make the necessary edits to the AIP. Here are my answers to your points: [1] Yes, that's correct. I believe that containing the SLA evaluation within the lifetime of a task as a duration-based sla will still have a purpose. It's technically implemented like an execution_timeout, but the goal of the SLA check is to execute a callback without killing the task: so that the accountable owner can be alerted, or a record of the SLA miss can be made without killing the task and throwing away the work that's been done. Measuring the SLA as a duration of a task is actually a feature that has been requested before, by multiple folks in previous discussions and PRs. [2] In the accepted version of the AIP, the DAG SLA is defined from the start_date of the DAG. I had initially considered for it to be evaluated from the scheduled start time (data_interval_end), but was stuck on how we could avoid false alarms when dag runs are cleared, which was discussed in the Google Doc that was circulated <https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/edit#heading=h.z45mqf5nt94> before the AIP was drafted. I've submitted a small change this week to introduce the dag_run clear_number attribute <https://github.com/apache/airflow/pull/34126> which would actually mitigate this issue. So if we feel that using the data_interval_end over the start_date for a SCHEDULED DagRunType is more appropriate, we can make that change on the open DAG SLA PR <https://github.com/apache/airflow/pull/33532>. [3] An example of using Deferrable Operators to create Custom Operators that monitor the SLA of a target task is discussed in the "Are there any downsides to this change?" section as a workaround that will allow users to measure SLA[3]: Task-level SLA measured from DAG-run scheduled start time. I agree that this isn't an optimal message to say that we are removing an existing feature. But perhaps the fact that users can implement this on their own in a scalable manner makes the feature deprecation easier to swallow... I know Pierre Jeambrun brought up the idea of implementing this way of detecting SLAs in a systematic way, so that instead of users attaching CustomOperators to their DAGs, Airflow will add Triggers that invoke sla callbacks at the end of the Trigger through a configuration parameter on the task that we want to check the SLA on. If you think making this approach of measuring SLAs on the Trigger, a first class idea on Airflow is a good idea - I'm happy to discuss this further and also learn more from you on how we could make this a reality. Some running ideas that come to my mind - new SLA feature requires a Triggerer process; we create SLA watching temporal triggers if sla is defined on the task; the sla_miss_callback uses 'context' argument, but the deferred operator that is executed upon the completion of the temporal trigger has a strict execution_timeout of 10 seconds (or other arbitrary short duration), to ensure that the worker slots are not taken up by slas for a significant duration; a separate pool for SLA triggers and workers (sla_pool?) so that it doesn't use the default pool? [4] I whole-heartedly agree with you on this - Deferrable Operators are my favorite, and it feels bad to me that we'd be introducing a feature that can't be extended to deferrables at all. And since the act of missing an SLA isn't a final state, I'm not sure exactly how we would be able to invoke an SLA callback on a Deferred Operator (or its corresponding Trigger) in the current workflow. On the idea of dags not being scheduled at a certain time - this is a pretty popular edge case that obviously falls under the subject of measuring SLAs, but I'm curious if it will be helpful to separate out the concern of maintaining a scheduler cluster with enough bandwidth, vs the concern of the DAG and the contained tasks running its job within its expected SLA, provided that the cluster is in a good state. There are existing reliability metrics that help us detect when the DAG runs themselves are being scheduled late, or if we are reaching the limit of worker parallelism. These are issues that will impact the delivery times of all of the DAGs and the Tasks, if not resolved immediately. So to me, as long as there are existing ways of detecting these more critical infrastructure issues (which there are), I am not too concerned that my SLA measuring might be impacted by a late scheduled DAG Run. On the idea of killing the running sla triggers, if the task succeeds - I think that's a great idea, it makes sure that the Trigger is removed when it is no longer needed. Please let me know if you think idea [3] is worth exploring further and materializing as a formal proposal. I'd be happy to find time to discuss it more in depth :) Sung On Tue, Sep 12, 2023 at 2:25 PM Daniel Standish <daniel.stand...@astronomer.io.invalid> wrote: > Some questions for you Sung. > > I tried looking to understand why we needed to remove behavior 3 discussed > in AIP: > > *[remove]* Task-level SLA measured from DAG-run scheduled start time > > > I'm just a little concerned that removing this would be a mistake because, > in my mind, part of the essence of what an SLA is, is an agreement about, > in essence, when your data should "be there". (or the "thing" done, etc) > > The behavior replacing it seems a lot like the existing execution_timeout > behavior: > > *[add]* Finally, I propose that we implement a new Task-level SLA feature, > > that is contained within the bounds of the task's lifetime, and is > measured > > within the task itself. > > > [1] Do I understand that correctly? If so, the question I would have is, > if we can already rely on execution timeout to tell us when an individual > task is taking too long, why do we need an SLA to tell us the same thing? > > [2] Can you also clarify how "dag slas" are defined? (behavior 1 in the > AIP). Is the duration to be applied relative to scheduled time or start > time? > > [3] I saw some discussion of using the triggerer component somehow in > this. Is that a thing? Can you share notes on implementation? I looked > on the AIP but did not find anything. > > [4] The fact that task sla cannot be used with deferrable operators seems > problematic to me. Users tend to expect (rightly or not) something like > "parity" between deferrable and non-deferrable tasks. SLAs would seem to > fall into a category where this seems like a reasonable expectation. > > Some discussion on solutions... > > I gather that the justification for removing behavior 3 is that it's too > complex / expensive. I feel like there's got to be a way. Thinking about > how we could implement behavior 3, there's two problems that sort of push > us into this perceived need to query everything all the time, in a manner > that is prohibitive. One is the problem that the dag may not get > scheduled. We need a way to raise if the dag is not scheduled X time after > it's "schedule *at*" date. For the moment let's call that the dag > scheduling timeout. What if, for each dag with such a scheduling timeout, > we launch a trigger that sleeps until that timeout and then when that runs > out it runs a single query to determine whether the dag run was scheduled > -- if not, raise. Then, for the task part of things.... well if the dag is > assumed to be reliably scheduled (and this is protected by the first part > here), then we can, at the time of dag run scheduling, launch triggers for > the individual tasks. And they would behave the same way. Run until > timeout and then query. > Perhaps better yet, perhaps when a task completes, it can delete its > associated "sla" triggers, so that its sla trigger would be cancelled, and > no query ever has to run. Same thing could be done for an overall dag SLA > -- at dag run completion could kill any trigger that is monitoring it's > execution duration. > > Anyway, I understand that the AIP is already accepted, but just hoping we > can have the discussion, in case we can get to a place where the existing > "expected" sla behavior can perhaps be preserved in a performant manner. > > Thanks > -- Sung Yun Cornell Tech '20 Master of Engineering in Computer Science