Re: [DISCUSS] Mechanism of SLA

Daniel Standish Tue, 12 Sep 2023 14:05:48 -0700

>
> [1] Yes, that's correct. I believe that containing the SLA evaluation
> within the lifetime of a task as a duration-based sla will still have a
> purpose. It's technically implemented like an execution_timeout, but the
> goal of the SLA check is to execute a callback without killing the task: so
> that the accountable owner can be alerted, or a record of the SLA miss can
> be made without killing the task and throwing away the work that's been
> done. Measuring the SLA as a duration of a task is actually a feature that
> has been requested before, by multiple folks in previous discussions and
> PRs.



Good point yeah I see what you mean -- alert but don't kill

[2] In the accepted version of the AIP, the DAG SLA is defined from the
> start_date of the DAG. I had initially considered for it to be evaluated
> from the scheduled start time (data_interval_end), but was stuck on how we
> could avoid false alarms when dag runs are cleared, which was discussed in
> the Google Doc that was circulated
> <
> https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/edit#heading=h.z45mqf5nt94
> >
> before the AIP was drafted. I've submitted a small change this week to
> introduce
> the dag_run clear_number attribute
> <https://github.com/apache/airflow/pull/34126> which would actually
> mitigate this issue. So if we feel that using the data_interval_end over
> the start_date for a SCHEDULED DagRunType is more appropriate, we can make
> that change on the open DAG SLA PR
> <https://github.com/apache/airflow/pull/33532>.


Using "data interval" doesn't sound right to me.  Data interval could be
completely disconnected from "when should the dag finish".  When you say
start_time you mean "when the dag run starts" not the dag attribute,
correct?  So this would catch a long-running dag, but would not catch when,
for some reason, there's a scheduling delay.  It's starting to feel like
there are multiple kinds of SLAs.  And forgive me if this already
well-trodden ground but I'm just now looking into it closely.  But so it
sounds like what you're adding is perhaps a subtype, DurationSLA, as
distinguished from maybe a different subtype, call it SchedulingSLA.  Maybe
we can offer both?

[3] An example of using Deferrable Operators to create Custom Operators
> that monitor the SLA of a target task is discussed in the "Are there any
> downsides to this change?" section as a workaround that will allow users to
> measure SLA[3]: Task-level SLA measured from DAG-run scheduled start time.
> I agree that this isn't an optimal message to say that we are removing an
> existing feature. But perhaps the fact that users can implement this on
> their own in a scalable manner makes the feature deprecation easier to
> swallow...


It's not just that it's not "optimal messaging".  Why would it not be
optimal?  Because people want the feature, use the feature, expect the
feature. That's the real issue, not the messaging.
If we accept that this is a valid and desirable use cases, I think forcing
users to write extra tasks to do this kind of monitoring, while it's a good
attempt at a mitigation, just is not gonna work.  Not work in the sense of
not good enough / acceptable.  To paint with an overly broad brush, no one,
practically speaking, is going to do that :)

[continuing with 3] I know Pierre Jeambrun brought up the idea of
> implementing this

way of detecting SLAs in a systematic way, so that instead of users
> attaching CustomOperators to their DAGs, Airflow will add Triggers that
> invoke sla callbacks at the end of the Trigger through a configuration
> parameter on the task that we want to check the SLA on. If you think making
> this approach of measuring SLAs on the Trigger, a first class idea on
> Airflow is a good idea - I'm happy to discuss this further and also learn

more from you on how we could make this a reality.

Some running ideas that
> come to my mind - new SLA feature requires a Triggerer process; we create
> SLA watching temporal triggers if sla is defined on the task; the
> sla_miss_callback uses 'context' argument, but the deferred operator that
> is executed upon the completion of the temporal trigger has a strict
> execution_timeout of 10 seconds (or other arbitrary short duration), to
> ensure that the worker slots are not taken up by slas for a significant
> duration; a separate pool for SLA triggers and workers (sla_pool?) so that
> it doesn't use the default pool?


You lost me a bit here with some of those details.  Happy to chat about
it.  But yeah i think having the new SLA behavior rely on triggerer, I
don't see a problem with that.  And I don't see a problem with using
triggerer for more things.  Can't speak for others though.

But separately, my original question here was, how is this AIP going to be
implemented?  Does the code still run on the dag processor?  Thanks

[4] I whole-heartedly agree with you on this - Deferrable Operators are my
> favorite, and it feels bad to me that we'd be introducing a feature that
> can't be extended to deferrables at all. And since the act of missing an
> SLA isn't a final state, I'm not sure exactly how we would be able to
> invoke an SLA callback on a Deferred Operator (or its corresponding
> Trigger) in the current workflow.


Help me understand what is the problem with running SLA callback.  Where /
how do they run now / in the AIP?  And how is this more problematic with
deferrables?

On the idea of dags not being scheduled at a certain time - this is a
> pretty popular edge case that obviously falls under the subject of
> measuring SLAs, but I'm curious if it will be helpful to separate out the
> concern of maintaining a scheduler cluster with enough bandwidth, vs the
> concern of the DAG and the contained tasks running its job within its
> expected SLA, provided that the cluster is in a good state. There are
> existing reliability metrics that help us detect when the DAG runs
> themselves are being scheduled late, or if we are reaching the limit of
> worker parallelism. These are issues that will impact the delivery times of
> all of the DAGs and the Tasks, if not resolved immediately. So to me, as
> long as there are existing ways of detecting these more critical
> infrastructure issues (which there are), I am not too concerned that my SLA
> measuring might be impacted by a late scheduled DAG Run.


Yeah I think this is a legitimate point.... essentially it's like "if your
dags are running behind, you have bigger problems".  And maybe scheduling
delay is not really an issue, and so we shouldn't really worry about it --
I honestly don't know.  I saw you recently added a scheduling delay
metric... maybe you know.  Nonetheless, it does feel like that's sort of
core to what an SLA is.  From business stakeholder perspective, business
user does not care why the task is delayed -- slow scheduling vs slow
running of *this* task vs slow running of *upstream* tasks.  A business
user wants their data in accordance with the SLA.  A business user doesn't
care that X task in the pipeline took X time -- they care that their data
is refreshed when expected.  Maybe the terminology SLA is throwing me off a
bit for this reason, cus it really implies an "agreement" i.e. with someone
else i.e. external stakeholder who... probably doesn't give a hoot about
individual task duration.

It feels like what we're dealing with is various kinds of "expectations".
Not that I'm proposing a rename :) But when my dag finishes, I can set an
"expectation" that the next dag run will be running by X time.  When the
dag starts running, I can set an expectation that the dag will be *done* by
X time (and that the tasks should be done by X times).  We can have
expectations about how long certain things will take to happen.  It seems
trigger would support this very easily.  Whether we should do it or not is
another question 😬

But yeah so, perhaps in a dag with an SLA, it emits events that are
relevant to SLA logic.  And then perhaps something consumes those events
and runs triggers that enforce those expectations.  And when events are
emitted which mean trigger not needed anymore, then that "something"
processes that event and in response kills the relevant trigger.  Thinking
out loud here.  It would put a lot more tasks on the triggerer, but all
they would be doing is sleeping.









On Tue, Sep 12, 2023 at 12:48 PM Sung Yun <sy...@cornell.edu> wrote:

> Hi Daniel,
>
> Thank you for the review! I'm happy to keep having the discussion to make
> sure we can introduce the right way of implementing these solutions into
> Airflow.
>
> My general impression from the community in the discussions so far led me
> to believe that deprecating the problematic feature would be welcomed at
> large, as long as we can educate folks about how they can achieve a similar
> outcome by using Deferrable Operators - which will be much less problematic
> for the Airflow core infrastructure.
> But as you say, if you think it is important to maintain the feature until
> we can find a suitable, systematic replacement for it, I'm happy to hold
> off on the implementations until we can make the necessary edits to the
> AIP.
>
> Here are my answers to your points:
>
> [1] Yes, that's correct. I believe that containing the SLA evaluation
> within the lifetime of a task as a duration-based sla will still have a
> purpose. It's technically implemented like an execution_timeout, but the
> goal of the SLA check is to execute a callback without killing the task: so
> that the accountable owner can be alerted, or a record of the SLA miss can
> be made without killing the task and throwing away the work that's been
> done. Measuring the SLA as a duration of a task is actually a feature that
> has been requested before, by multiple folks in previous discussions and
> PRs.
>
> [2] In the accepted version of the AIP, the DAG SLA is defined from the
> start_date of the DAG. I had initially considered for it to be evaluated
> from the scheduled start time (data_interval_end), but was stuck on how we
> could avoid false alarms when dag runs are cleared, which was discussed in
> the Google Doc that was circulated
> <
> https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/edit#heading=h.z45mqf5nt94
> >
> before the AIP was drafted. I've submitted a small change this week to
> introduce
> the dag_run clear_number attribute
> <https://github.com/apache/airflow/pull/34126> which would actually
> mitigate this issue. So if we feel that using the data_interval_end over
> the start_date for a SCHEDULED DagRunType is more appropriate, we can make
> that change on the open DAG SLA PR
> <https://github.com/apache/airflow/pull/33532>.
>
> [3] An example of using Deferrable Operators to create Custom Operators
> that monitor the SLA of a target task is discussed in the "Are there any
> downsides to this change?" section as a workaround that will allow users to
> measure SLA[3]: Task-level SLA measured from DAG-run scheduled start time.
> I agree that this isn't an optimal message to say that we are removing an
> existing feature. But perhaps the fact that users can implement this on
> their own in a scalable manner makes the feature deprecation easier to
> swallow... I know Pierre Jeambrun brought up the idea of implementing this
> way of detecting SLAs in a systematic way, so that instead of users
> attaching CustomOperators to their DAGs, Airflow will add Triggers that
> invoke sla callbacks at the end of the Trigger through a configuration
> parameter on the task that we want to check the SLA on. If you think making
> this approach of measuring SLAs on the Trigger, a first class idea on
> Airflow is a good idea - I'm happy to discuss this further and also learn
> more from you on how we could make this a reality. Some running ideas that
> come to my mind - new SLA feature requires a Triggerer process; we create
> SLA watching temporal triggers if sla is defined on the task; the
> sla_miss_callback uses 'context' argument, but the deferred operator that
> is executed upon the completion of the temporal trigger has a strict
> execution_timeout of 10 seconds (or other arbitrary short duration), to
> ensure that the worker slots are not taken up by slas for a significant
> duration; a separate pool for SLA triggers and workers (sla_pool?) so that
> it doesn't use the default pool?
>
> [4] I whole-heartedly agree with you on this - Deferrable Operators are my
> favorite, and it feels bad to me that we'd be introducing a feature that
> can't be extended to deferrables at all. And since the act of missing an
> SLA isn't a final state, I'm not sure exactly how we would be able to
> invoke an SLA callback on a Deferred Operator (or its corresponding
> Trigger) in the current workflow.
>
> On the idea of dags not being scheduled at a certain time - this is a
> pretty popular edge case that obviously falls under the subject of
> measuring SLAs, but I'm curious if it will be helpful to separate out the
> concern of maintaining a scheduler cluster with enough bandwidth, vs the
> concern of the DAG and the contained tasks running its job within its
> expected SLA, provided that the cluster is in a good state. There are
> existing reliability metrics that help us detect when the DAG runs
> themselves are being scheduled late, or if we are reaching the limit of
> worker parallelism. These are issues that will impact the delivery times of
> all of the DAGs and the Tasks, if not resolved immediately. So to me, as
> long as there are existing ways of detecting these more critical
> infrastructure issues (which there are), I am not too concerned that my SLA
> measuring might be impacted by a late scheduled DAG Run.
>
> On the idea of killing the running sla triggers, if the task succeeds - I
> think that's a great idea, it makes sure that the Trigger is removed when
> it is no longer needed.
>
> Please let me know if you think idea [3] is worth exploring further and
> materializing as a formal proposal. I'd be happy to find time to discuss it
> more in depth :)
>
> Sung
>
> On Tue, Sep 12, 2023 at 2:25 PM Daniel Standish
> <daniel.stand...@astronomer.io.invalid> wrote:
>
> > Some questions for you Sung.
> >
> > I tried looking to understand why we needed to remove behavior 3
> discussed
> > in AIP:
> >
> > *[remove]* Task-level SLA measured from DAG-run scheduled start time
> >
> >
> > I'm just a little concerned that removing this would be a mistake
> because,
> > in my mind, part of the essence of what an SLA is, is an agreement about,
> > in essence, when your data should "be there". (or the "thing" done, etc)
> >
> > The behavior replacing it seems a lot like the existing execution_timeout
> > behavior:
> >
> > *[add]* Finally, I propose that we implement a new Task-level SLA
> feature,
> > > that is contained within the bounds of the task's lifetime, and is
> > measured
> > > within the task itself.
> >
> >
> > [1] Do I understand that correctly?  If so, the question I would have is,
> > if we can already rely on execution timeout to tell us when an individual
> > task is taking too long, why do we need an SLA to tell us the same thing?
> >
> > [2] Can you also clarify how "dag slas" are defined?  (behavior 1 in the
> > AIP).  Is the duration to be applied relative to scheduled time or start
> > time?
> >
> > [3] I saw some discussion of using the triggerer component somehow in
> > this.  Is that a thing?  Can you share notes on implementation?  I looked
> > on the AIP but did not find anything.
> >
> > [4] The fact that task sla cannot be used with deferrable operators seems
> > problematic to me.  Users tend to expect (rightly or not) something like
> > "parity" between deferrable and non-deferrable tasks.  SLAs would seem to
> > fall into a category where this seems like a reasonable expectation.
> >
> > Some discussion on solutions...
> >
> > I gather that the justification for removing behavior 3 is that it's too
> > complex / expensive.  I feel like there's got to be a way.   Thinking
> about
> > how we could implement behavior 3, there's two problems that sort of push
> > us into this perceived need to query everything all the time, in a manner
> > that is prohibitive.  One is the problem that the dag may not get
> > scheduled.  We need a way to raise if the dag is not scheduled X time
> after
> > it's "schedule *at*" date.  For the moment let's call that the dag
> > scheduling timeout.  What if, for each dag with such a scheduling
> timeout,
> > we launch a trigger that sleeps until that timeout and then when that
> runs
> > out it runs a single query to determine whether the dag run was scheduled
> > -- if not, raise.  Then, for the task part of things.... well if the dag
> is
> > assumed to be reliably scheduled (and this is protected by the first part
> > here), then we can, at the time of dag run scheduling, launch triggers
> for
> > the individual tasks.  And they would behave the same way.  Run until
> > timeout and then query.
> > Perhaps better yet, perhaps when a task completes, it can delete its
> > associated "sla" triggers, so that its sla trigger would be cancelled,
> and
> > no query ever has to run.  Same thing could be done for an overall dag
> SLA
> > -- at dag run completion could kill any trigger that is monitoring it's
> > execution duration.
> >
> > Anyway, I understand that the AIP is already accepted, but just hoping we
> > can have the discussion, in case we can get to a place where the existing
> > "expected" sla behavior can perhaps be preserved in a performant manner.
> >
> > Thanks
> >
>
>
> --
> Sung Yun
> Cornell Tech '20
> Master of Engineering in Computer Science
>

Re: [DISCUSS] Mechanism of SLA

Reply via email to