Re: [DISCUSS] Mechanism of SLA

Sung Yun Tue, 12 Sep 2023 12:48:29 -0700

Hi Daniel,

Thank you for the review! I'm happy to keep having the discussion to make
sure we can introduce the right way of implementing these solutions into
Airflow.

My general impression from the community in the discussions so far led me
to believe that deprecating the problematic feature would be welcomed at
large, as long as we can educate folks about how they can achieve a similar
outcome by using Deferrable Operators - which will be much less problematic
for the Airflow core infrastructure.
But as you say, if you think it is important to maintain the feature until
we can find a suitable, systematic replacement for it, I'm happy to hold
off on the implementations until we can make the necessary edits to the AIP.

Here are my answers to your points:

[1] Yes, that's correct. I believe that containing the SLA evaluation
within the lifetime of a task as a duration-based sla will still have a
purpose. It's technically implemented like an execution_timeout, but the
goal of the SLA check is to execute a callback without killing the task: so
that the accountable owner can be alerted, or a record of the SLA miss can
be made without killing the task and throwing away the work that's been
done. Measuring the SLA as a duration of a task is actually a feature that
has been requested before, by multiple folks in previous discussions and
PRs.

[2] In the accepted version of the AIP, the DAG SLA is defined from the
start_date of the DAG. I had initially considered for it to be evaluated
from the scheduled start time (data_interval_end), but was stuck on how we
could avoid false alarms when dag runs are cleared, which was discussed in
the Google Doc that was circulated
<https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/edit#heading=h.z45mqf5nt94>
before the AIP was drafted. I've submitted a small change this week to
introduce
the dag_run clear_number attribute
<https://github.com/apache/airflow/pull/34126> which would actually
mitigate this issue. So if we feel that using the data_interval_end over
the start_date for a SCHEDULED DagRunType is more appropriate, we can make
that change on the open DAG SLA PR
<https://github.com/apache/airflow/pull/33532>.

[3] An example of using Deferrable Operators to create Custom Operators
that monitor the SLA of a target task is discussed in the "Are there any
downsides to this change?" section as a workaround that will allow users to
measure SLA[3]: Task-level SLA measured from DAG-run scheduled start time.
I agree that this isn't an optimal message to say that we are removing an
existing feature. But perhaps the fact that users can implement this on
their own in a scalable manner makes the feature deprecation easier to
swallow... I know Pierre Jeambrun brought up the idea of implementing this
way of detecting SLAs in a systematic way, so that instead of users
attaching CustomOperators to their DAGs, Airflow will add Triggers that
invoke sla callbacks at the end of the Trigger through a configuration
parameter on the task that we want to check the SLA on. If you think making
this approach of measuring SLAs on the Trigger, a first class idea on
Airflow is a good idea - I'm happy to discuss this further and also learn
more from you on how we could make this a reality. Some running ideas that
come to my mind - new SLA feature requires a Triggerer process; we create
SLA watching temporal triggers if sla is defined on the task; the
sla_miss_callback uses 'context' argument, but the deferred operator that
is executed upon the completion of the temporal trigger has a strict
execution_timeout of 10 seconds (or other arbitrary short duration), to
ensure that the worker slots are not taken up by slas for a significant
duration; a separate pool for SLA triggers and workers (sla_pool?) so that
it doesn't use the default pool?

[4] I whole-heartedly agree with you on this - Deferrable Operators are my
favorite, and it feels bad to me that we'd be introducing a feature that
can't be extended to deferrables at all. And since the act of missing an
SLA isn't a final state, I'm not sure exactly how we would be able to
invoke an SLA callback on a Deferred Operator (or its corresponding
Trigger) in the current workflow.

On the idea of dags not being scheduled at a certain time - this is a
pretty popular edge case that obviously falls under the subject of
measuring SLAs, but I'm curious if it will be helpful to separate out the
concern of maintaining a scheduler cluster with enough bandwidth, vs the
concern of the DAG and the contained tasks running its job within its
expected SLA, provided that the cluster is in a good state. There are
existing reliability metrics that help us detect when the DAG runs
themselves are being scheduled late, or if we are reaching the limit of
worker parallelism. These are issues that will impact the delivery times of
all of the DAGs and the Tasks, if not resolved immediately. So to me, as
long as there are existing ways of detecting these more critical
infrastructure issues (which there are), I am not too concerned that my SLA
measuring might be impacted by a late scheduled DAG Run.

On the idea of killing the running sla triggers, if the task succeeds - I
think that's a great idea, it makes sure that the Trigger is removed when
it is no longer needed.

Please let me know if you think idea [3] is worth exploring further and
materializing as a formal proposal. I'd be happy to find time to discuss it
more in depth :)

Sung

On Tue, Sep 12, 2023 at 2:25 PM Daniel Standish
<daniel.stand...@astronomer.io.invalid> wrote:

> Some questions for you Sung.
>
> I tried looking to understand why we needed to remove behavior 3 discussed
> in AIP:
>
> *[remove]* Task-level SLA measured from DAG-run scheduled start time
>
>
> I'm just a little concerned that removing this would be a mistake because,
> in my mind, part of the essence of what an SLA is, is an agreement about,
> in essence, when your data should "be there". (or the "thing" done, etc)
>
> The behavior replacing it seems a lot like the existing execution_timeout
> behavior:
>
> *[add]* Finally, I propose that we implement a new Task-level SLA feature,
> > that is contained within the bounds of the task's lifetime, and is
> measured
> > within the task itself.
>
>
> [1] Do I understand that correctly?  If so, the question I would have is,
> if we can already rely on execution timeout to tell us when an individual
> task is taking too long, why do we need an SLA to tell us the same thing?
>
> [2] Can you also clarify how "dag slas" are defined?  (behavior 1 in the
> AIP).  Is the duration to be applied relative to scheduled time or start
> time?
>
> [3] I saw some discussion of using the triggerer component somehow in
> this.  Is that a thing?  Can you share notes on implementation?  I looked
> on the AIP but did not find anything.
>
> [4] The fact that task sla cannot be used with deferrable operators seems
> problematic to me.  Users tend to expect (rightly or not) something like
> "parity" between deferrable and non-deferrable tasks.  SLAs would seem to
> fall into a category where this seems like a reasonable expectation.
>
> Some discussion on solutions...
>
> I gather that the justification for removing behavior 3 is that it's too
> complex / expensive.  I feel like there's got to be a way.   Thinking about
> how we could implement behavior 3, there's two problems that sort of push
> us into this perceived need to query everything all the time, in a manner
> that is prohibitive.  One is the problem that the dag may not get
> scheduled.  We need a way to raise if the dag is not scheduled X time after
> it's "schedule *at*" date.  For the moment let's call that the dag
> scheduling timeout.  What if, for each dag with such a scheduling timeout,
> we launch a trigger that sleeps until that timeout and then when that runs
> out it runs a single query to determine whether the dag run was scheduled
> -- if not, raise.  Then, for the task part of things.... well if the dag is
> assumed to be reliably scheduled (and this is protected by the first part
> here), then we can, at the time of dag run scheduling, launch triggers for
> the individual tasks.  And they would behave the same way.  Run until
> timeout and then query.
> Perhaps better yet, perhaps when a task completes, it can delete its
> associated "sla" triggers, so that its sla trigger would be cancelled, and
> no query ever has to run.  Same thing could be done for an overall dag SLA
> -- at dag run completion could kill any trigger that is monitoring it's
> execution duration.
>
> Anyway, I understand that the AIP is already accepted, but just hoping we
> can have the discussion, in case we can get to a place where the existing
> "expected" sla behavior can perhaps be preserved in a performant manner.
>
> Thanks
>

-- 
Sung Yun
Cornell Tech '20
Master of Engineering in Computer Science

Re: [DISCUSS] Mechanism of SLA

Reply via email to