Re: [DISCUSS] Mechanism of SLA

Jarek Potiuk Sat, 15 Jul 2023 00:11:10 -0700

I really like the proposal as it is now. I think it is generally ready to
be put up to vote (and implement).
I think it has a chance to finally get our SLA feature straightened out.


J.


On Sat, Jul 8, 2023 at 12:00 AM Sung Yun <sy...@cornell.edu> wrote:

> Thank you for the clarification Jarek :)
>
> I’ve updated the AIP on the Confluence page with your suggestion - please
> let me know what you folks think!
>
> In summary, I think it will serve as a great way to maintain some capacity
> to measure a soft-timeout within a task. Obvious pros of this approach are
> its reliability and scalability. The down side is that I think that making
> it work with Deferrable Operators in an expected way will prove to be
> difficult.
>
>
> https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=247828059#content/view/247828059
>
> Sent from my iPhone
>
> > On Jul 4, 2023, at 3:51 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > 
> >>
> >> Which forking strategy are we exactly proposing?
> >
> > The important part is that you have a separate process that will run a
> > separate Python interpreter so that if the task runs a "C" code without a
> > loop, the "timer" thread will be able to stop it regardless (for timeout)
> > and one that can run "in-parallel" SLA. So lillely it is
> >
> > local task
> > | - timeout fork (kills both "chlidren" if fired)
> >    | - sla timer (runs in parallel to task)
> >    | - task code
> >
> >
> > Then when SLA timer fires, it will just notify - but let the task_code
> run.
> > When timeout fires it will kill both child processes.
> >
> > J.
> >
> >
> >
> >
> >
> >> On Wed, Jun 21, 2023 at 9:22 PM Sung Yun <sy...@cornell.edu> wrote:
> >>
> >> Hi Jarek, I've been mulling over the implementation of (3) task:
> >> time_limit_sla, and I have some follow up questions about the
> >> implementation.
> >>
> >> Which forking strategy are we exactly proposing? Currently, we invoke
> >> task.execute_callable within the taskinstance, which we can effectively
> >> think of as the parent process for the sake of this discussion.
> >>
> >> Are we proposing:
> >> Structure 1
> >> parent: task.execute_callable
> >> └ child 1: sla timer
> >> └ child 2: execution_timeout timer
> >>
> >> Or:
> >> Structure 2
> >> parent: looping process that parses signals from child Processes
> >> └ child 1: sla timer
> >> └ child 2: execution_timeout timer
> >> └ child 3: task.execute_callable
> >>
> >> And also, are we proposing that the callbacks be executed in the child
> >> processes (when the timers complete) or in the parent process?
> >>
> >> Pierre: great questions...
> >>
> >>> How hard would it be to spawn them when a task run with SLA configured
> as
> >> a
> >> normal workload on the worker ?
> >>> Maybe on a dedicated queue / worker ?
> >>
> >> My current thought is that having a well-abstracted subclass
> implementation
> >> of Deferrable Operator may make the most sense for now. I worry that
> having
> >> a configuration-driven way of creating sla monitoring tasks, where they
> are
> >> created behind the scenes, would create confusion in the user base.
> >> Especially so, if there is no dedicated worker pool that will completely
> >> isolate the monitoring tasks from the resource pool of normal tasks. So
> I'm
> >> curious to hear what options we would have in setting up a dedicated
> worker
> >> pool to compliment this idea.
> >>
> >> Sung
> >>
> >> On Tue, Jun 20, 2023 at 2:08 PM Pierre Jeambrun <pierrejb...@gmail.com>
> >> wrote:
> >>
> >>> This task_sla is more and more making me think of a ‘task’ on its own.
> It
> >>> would need to be run in parallel, non blocking, not overlap between
> each
> >>> other, etc…
> >>>
> >>> How hard would it be to spawn them when a task run with SLA configured
> >> as a
> >>> normal workload on the worker ?
> >>> Maybe on a dedicated queue / worker ?
> >>>
> >>>> On Tue 20 Jun 2023 at 16:47, Sung Yun <sy...@cornell.edu> wrote:
> >>>
> >>>> Thank you all for your continued engagement and input! It looks like
> >>>> Iaroslav's layout of 3 different labels of SLA's is helping us group
> >> the
> >>>> implementation into different categories, so I will organize my own
> >>>> responses in those logical groupings as well.
> >>>>
> >>>> 1. dag_sla
> >>>> 2. task_sla
> >>>> 3. task: time_limit_sla
> >>>>
> >>>> 1. dag_sla
> >>>> I am going to lean in on Jarek's support in driving us to agree on the
> >>> fact
> >>>> that, dag_sla seems like the only one that can stay within the
> >> scheduler
> >>>> without incurring an excessive burden on the core infrastructure.
> >>>>
> >>>>> So, I totally agree about dag level slas. It's very important to have
> >>> it
> >>>> and according to Sung Yun proposal it should be implemented not on the
> >>>> scheduler job level.
> >>>>
> >>>> In response to this, I want to clarify that I am specifically
> >>> highlighting
> >>>> that dag_sla is the only one that can be supported by the scheduler
> >> job.
> >>>> Dag_sla isn't a feature that exists right now, and my submission
> >> proposes
> >>>> exactly this!
> >>>>
> >>>> 2. task_sla
> >>>> I think Utkarsh's response really helped highlight another compounding
> >>>> issue with SLAs in Airflow, which is that users have such varying
> >>>> definition of SLAs, and what they want to do when that SLA is
> breached.
> >>>> On a high level, task_sla relies on a relationship between the
> dag_run,
> >>> and
> >>>> a specific task within that specific dag_run: it is the time between a
> >>>> dag_run's scheduled start time, and the actual start or end time of an
> >>>> individual task within that run.
> >>>> Hence, it is impossible for it to be computed in a distributed way
> that
> >>>> address all of the issues highlighted in the AIP, and needs to be
> >> managed
> >>>> by a central process that has access to the single source of truth.
> >>>> As Utkarsh suggests, I think this is perhaps doable as a separate
> >>> process,
> >>>> and probably would be much safer to do it within a separate process.
> >>>> My only concern is that we would be introducing a separate Airflow
> >>> process,
> >>>> that is strictly optional, but one that requires quite a large amount
> >> of
> >>>> investment in designing the right abstractions to meet user
> >> satisfaction
> >>>> and reliability guarantees.
> >>>> It will also require us to review the database's dag/dag_run/task
> >> tables'
> >>>> indexing model to make sure that continuous queries to the database
> >> will
> >>>> not overload it.
> >>>> This isn't simple, because we will have to select tasks in any state
> >>>> (FAILED, SUCCESS or RUNNING) that has not yet had their SLA evaluated,
> >>> from
> >>>> any dagRun (FAILED, SUCCESS or RUNNING), in order to make sure we
> don't
> >>>> miss any tasks - because in this paradigm, the concept of SLA
> >> triggering
> >>> is
> >>>> decoupled from a dagrun or task execution.
> >>>> A query that selects tasks in ANY state in ANY state of dag_run is
> >> bound
> >>> to
> >>>> be incredibly expensive - and I discuss this challenge in the
> >> Confluence
> >>>> AIP and the Google Doc.
> >>>> This will possibly be even more difficult to achieve, because we
> should
> >>>> have the capacity to support multiple processes since we now support
> >> High
> >>>> Availability in Airflow.
> >>>> So although setting up a separate process decouples the SLA evaluation
> >>> from
> >>>> the scheduler, we need to acknowledge that we may be introducing a
> >> heavy
> >>>> dependency on the metadata database.
> >>>>
> >>>> My suggestion to leverage the existing Triggerer process to design
> >>>> monitoring Deferrable Operators to execute SLA callbacks has the
> >> benefit
> >>> of
> >>>> reducing the load on the database while achieving similar goals,
> >> because
> >>> it
> >>>> registers the SLA monitoring operator as a TASK to the dag_run that it
> >> is
> >>>> associated with, and prevents the dag_run from completing if the SLA
> >> has
> >>>> not yet computed. This means that our query will be strictly limited
> to
> >>>> just the dagRuns in RUNNING state - this is a HUGE difference from
> >> having
> >>>> to query dagruns in all states in a separate process, because we are
> >>> merely
> >>>> attaching a few additional tasks to be executed into existing
> dag_runs.
> >>>>
> >>>> In summary: I'm open to this idea, I just have not been able to think
> >> of
> >>> a
> >>>> way to manage this without overloading the scheduler, or the database.
> >>>>
> >>>> 3. task: time_limit_sla
> >>>> Jarek: That sounds like a great idea that we could group into this AIP
> >> -
> >>> I
> >>>> will make some time to add some code snippets into the AIP to make
> this
> >>>> idea a bit clearer to everyone reading it in preparation for the vote
> >>>>
> >>>>
> >>>> Sung
> >>>>
> >>>> On Sun, Jun 18, 2023 at 9:38 PM utkarsh sharma <
> utkarshar...@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>>>>
> >>>>>> This can be IMHO implemented on the task level. We currently have
> >>>> timeout
> >>>>>> implemented this way - whenever we start the task, we can have a
> >>> signal
> >>>>>> handler registered with "real" time registered that will cancel the
> >>>> task.
> >>>>>> But I can imagine similar approach with signal and propagate the
> >>>>>> information that task exceeded the time it has been allocated but
> >>> would
> >>>>> not
> >>>>>> stop it, just propagate the information (in a form of current way
> >> we
> >>> do
> >>>>>> callbacks for example, or maybe (even better) only run it in the
> >>>> context
> >>>>> of
> >>>>>> task  to signal "soft timeout" per task:
> >>>>>>
> >>>>>>>            signal.signal(signal.SIGALRM, self.handle_timeout)
> >>>>>>>           signal.setitimer(signal.ITIMER_REAL, self.seconds)
> >>>>>>
> >>>>>> This has an advantage that it is fully distributed - i.e. we do not
> >>>> need
> >>>>>> anything to monitor 1000s of tasks running to decide if SLA has
> >> been
> >>>>>> breached. It's the task itself that will get the "soft" timeout and
> >>>>>> propagate it (and then whoever receives the callback can decide
> >> what
> >>> to
> >>>>> do
> >>>>>> next - and this "callback" can happen in either the task context or
> >>> it
> >>>>>> could be done in a DagFileProcessor context as we do currently -
> >>> though
> >>>>> the
> >>>>>> in-task processing seems much more distributed and scalable in
> >>> nature.
> >>>>>> There is one watch-out here that this is not **guaranteed** to
> >> work,
> >>>>> there
> >>>>>> are cases, that we already saw that the SIGALRM is not going to be
> >>>>> handled
> >>>>>> locally, when the task uses long running C-level function that is
> >> not
> >>>>>> written in the way to react to signals generated in Python (thinks
> >>>>>> low-level long-running Pandas c-method call that does not check
> >>> signal
> >>>>> in a
> >>>>>> long-running-loop. That however probably could be handled by one
> >> more
> >>>>>> process fork and have a dedicated child process that would monitor
> >>>>> running
> >>>>>> tasks from a separate process - and we could actually improve both
> >>>>> timeout
> >>>>>> and SLA handling by introducing such extra forked process to handle
> >>>>>> timeout/task level time_limit_sla, so IMHO this is an opportunity
> >> to
> >>>>>> improve things.
> >>>>>>
> >>>>>
> >>>>>
> >>>>> Building on what Jarek mentioned, If we can enable the scheduler to
> >>> emit
> >>>>> events for DAGs with SLA configured in cases of
> >>>>> 1. DAG starts executing
> >>>>> 2. Task - start executing(for every task)
> >>>>> 3. Task - stop executing(for every task)
> >>>>> 4. DAG stops executing
> >>>>>
> >>>>> And have a separate process(per dag run) that can keep monitoring
> >> such
> >>>>> events and execute a callback in the following circumstances:
> >>>>> 1. DAG level SLA miss
> >>>>>    - When the entire DAG didn't finish in a specific time
> >>>>> 2. Task-level SLA miss
> >>>>>    - Counting time from the start of the DAG to the end of a task.
> >>>>>    - Start of a task to end of a task.
> >>>>>
> >>>>> I think above approch should be addressing issues listed in AIP-57
> >>>>> <
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-57+SLA+as+a+DAG-Level+Feature
> >>>>>>
> >>>>> 1. Since we have a septate process we no longer have to wait for the
> >>>>> Tasks/DAG to be in the SUCCESS/SKIPPED state or any other terminal
> >>> state.
> >>>>> In this process, we can have a loop executing periodically in
> >> intervals
> >>>> of
> >>>>> ex- 1sec to monitor SLA misses by monitoring events data and
> >>>> task_instance
> >>>>> table.
> >>>>> 2. For Manually/Dataset triggered dags, we no longer have a
> >> dependency
> >>>> on a
> >>>>> fixed schedule, everything we need to evaluate SLA miss is already
> >>>> present
> >>>>> in Events for that specific DAG.
> >>>>> 3. This approach also enables us to run callbacks on the task level.
> >>>>> 4. We can remove calls to sla_miss_callbacks every time is called
> >>>>> *dag_run.update_state*
> >>>>>
> >>>>> A couple of things I'm not about are -
> >>>>> 1. Where to execute the callbacks. Executing a callback in the same
> >>>> process
> >>>>> as the monitoring process can have a downside if the callback takes
> >>> much
> >>>>> time to execute it will probably cause other SLA callbacks to be
> >>> delayed.
> >>>>> 2. Context of execution of callback, we have to maintain the same
> >>> context
> >>>>> in which the callback is defined.
> >>>>>
> >>>>> Would love to know other people's thoughts on this :)
> >>>>>
> >>>>> Thanks,
> >>>>> Utkarsh Sharma
> >>>>>
> >>>>>
> >>>>> On Sun, Jun 18, 2023 at 4:08 PM Iaroslav Poskriakov <
> >>>>> yaroslavposkrya...@gmail.com> wrote:
> >>>>>
> >>>>>> I want to say that airflow is a very popular project and the ways
> >> of
> >>>>>> calculating SLA are different. Because of different business cases.
> >>> And
> >>>>> if
> >>>>>> it's possible we should make most of them from the box.
> >>>>>>
> >>>>>> вс, 18 июн. 2023 г. в 13:30, Iaroslav Poskriakov <
> >>>>>> yaroslavposkrya...@gmail.com>:
> >>>>>>
> >>>>>>> So, I totally agree about dag level slas. It's very important to
> >>> have
> >>>>> it
> >>>>>>> and according to Sung Yun proposal it should be implemented not
> >> on
> >>>> the
> >>>>>>> scheduler job level.
> >>>>>>>
> >>>>>>> Regarding the second way of determining SLA: <task state STARTED>
> >>> -->
> >>>>>>> ..<doesn't matter what happened>..  --> <task state SUCCESS>.
> >>>>>>> It's very helpful in the way when we want to achieve not
> >> technical
> >>>> SLA
> >>>>>> but
> >>>>>>> business SLA for the team which is using that DAG. Because
> >> between
> >>>>> those
> >>>>>>> two states anything could happen and at the end we might want to
> >>>>>> understand
> >>>>>>> high level SLA for the task. Because it doesn't matter for
> >>> business I
> >>>>>> guess
> >>>>>>> that path of states of the task was something like: STARTED ->
> >>>> RUNNING
> >>>>> ->
> >>>>>>> FAILED -> RUNNING -> FAILED -> RUNNING -> SUCCESS. And in case
> >> when
> >>>>>>> something similar is happening it can be helpful to have an
> >>>> opportunity
> >>>>>> of automatically
> >>>>>>> recognizing  that the expected time for the task crossed the
> >>> border.
> >>>>>>>
> >>>>>>> I agree that for the scheduler it can be too heavy. And also for
> >>> that
> >>>>>>> purpose we need to have some process which is running in parallel
> >>>> with
> >>>>>> the
> >>>>>>> task. It can be one more job for example which is running on the
> >>> same
> >>>>>>> machine as Scheduler, or not on the same.
> >>>>>>>
> >>>>>>>
> >>>>>>> About the third part of my proposal - time for the task in the
> >>>>>>> RUNNING state. I agree with you, Jarek. We can implement it on
> >> the
> >>>> task
> >>>>>>> level. For me it seems good.
> >>>>>>>
> >>>>>>> Yaro1
> >>>>>>>
> >>>>>>> вс, 18 июн. 2023 г. в 08:12, Jarek Potiuk <ja...@potiuk.com>:
> >>>>>>>
> >>>>>>>> I am also for DAG level SLA only (but maybe there are some
> >>> twists).
> >>>>>>>>
> >>>>>>>> And I hope (since Sung Yun has not given up on that) - maybe
> >> that
> >>> is
> >>>>> the
> >>>>>>>> right time that others here will chime in and maybe it will let
> >>> the
> >>>>> vote
> >>>>>>>> go
> >>>>>>>> on? I think it would be great to get the SLA feature sorted out
> >> so
> >>>>> that
> >>>>>> we
> >>>>>>>> have a chance to stop answering ("yeah, we know SLA is broken,
> >> it
> >>>> has
> >>>>>>>> always been"). It would be nice to say "yeah the old deprecated
> >>> SLA
> >>>> is
> >>>>>>>> broken, but we have this new mechanism(s) that replace it". The
> >>> one
> >>>>>>>> proposed by Sung has a good chance of being such a replacement.
> >>>>>>>>
> >>>>>>>> I think having a task-level SLA managed by the Airflow framework
> >>>> might
> >>>>>>>> indeed be too costly and does not fit well in the current
> >>>>> architecture.
> >>>>>> I
> >>>>>>>> think attempting to monitor how long a given task runs by the
> >>>>> scheduler
> >>>>>> is
> >>>>>>>> simply a huge overkill. Generally speaking - scheduler (as
> >>>> surprising
> >>>>> it
> >>>>>>>> might be for anyone) does not monitor executing tasks (at least
> >>>>>>>> principally
> >>>>>>>> speaking). It merely submits the tasks to execute to executor
> >> and
> >>>> let
> >>>>>>>> executor handle all kinds of monitoring of what is being
> >> executed
> >>>>> when,
> >>>>>>>> and
> >>>>>>>> then - depending on the different types of executors there are
> >>>> various
> >>>>>>>> conditions when and how task is being executed, and various ways
> >>> how
> >>>>> you
> >>>>>>>> can define different kinds of task SLAs. Or at least this is
> >> how I
> >>>>> think
> >>>>>>>> about the distributed nature of Airflow on a "logical" level.
> >> Once
> >>>>> task
> >>>>>> is
> >>>>>>>> queued for execution, the scheduler takes its hands off and
> >> turns
> >>>> its
> >>>>>>>> attention to tasks that are not yet scheduled and should be or
> >>> tasks
> >>>>>> that
> >>>>>>>> are scheduled but not queued yet.
> >>>>>>>>
> >>>>>>>> But maybe some of the SLA "task" expectations can be
> >> implemented
> >>>> in a
> >>>>>>>> limited version serving very limited cases on a task level?
> >>>>>>>>
> >>>>>>>> Referring to what Yaro1 wrote:
> >>>>>>>>
> >>>>>>>>> 1. It doesn't matter for us how long we are spending time on
> >>> some
> >>>>>>>> specific
> >>>>>>>> task. It's important to have an understanding of the lag between
> >>>>>>>> execution_date of dag and success state for the task. We can
> >> call
> >>> it
> >>>>>>>> dag_sla. It's similar to the current implementation of
> >>> manage_slas.
> >>>>>>>>
> >>>>>>>> This is basically what Sung proposes, I believe.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> 2. It's important to have an understanding and managing how
> >> long
> >>>>> some
> >>>>>>>> specific task is working. In my opinion working is the state
> >>> between
> >>>>>> task
> >>>>>>>> last start_date and task first (after last start_date) SUCCESS
> >>>> state.
> >>>>> So
> >>>>>>>> for example for the task which is placed in FAILED state we
> >> still
> >>>> have
> >>>>>> to
> >>>>>>>> check an SLA in that strategy. We can call it task_sla.
> >>>>>>>>
> >>>>>>>> I am not sure if I understand it, but If I do, then this is the
> >>>> "super
> >>>>>>>> costly" SLA processing that we should likely avoid. I would love
> >>> to
> >>>>> hear
> >>>>>>>> however, what are some specific use cases that we could show
> >> here,
> >>>>> maybe
> >>>>>>>> there are other ways we can achieve similar things.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> 3. Sometimes we need to manage time for the task in the
> >> RUNNING
> >>>>> state.
> >>>>>>>> We
> >>>>>>>> can call it time_limit_sla.
> >>>>>>>>
> >>>>>>>> This can be IMHO implemented on the task level. We currently
> >> have
> >>>>>> timeout
> >>>>>>>> implemented this way - whenever we start the task, we can have a
> >>>>> signal
> >>>>>>>> handler registered with "real" time registered that will cancel
> >>> the
> >>>>>> task.
> >>>>>>>> But I can imagine similar approach with signal and propagate the
> >>>>>>>> information that task exceeded the time it has been allocated
> >> but
> >>>>> would
> >>>>>>>> not
> >>>>>>>> stop it, just propagate the information (in a form of current
> >> way
> >>> we
> >>>>> do
> >>>>>>>> callbacks for example, or maybe (even better) only run it in the
> >>>>> context
> >>>>>>>> of
> >>>>>>>> task  to signal "soft timeout" per task:
> >>>>>>>>
> >>>>>>>>>            signal.signal(signal.SIGALRM, self.handle_timeout)
> >>>>>>>>>           signal.setitimer(signal.ITIMER_REAL, self.seconds)
> >>>>>>>>
> >>>>>>>> This has an advantage that it is fully distributed - i.e. we do
> >>> not
> >>>>> need
> >>>>>>>> anything to monitor 1000s of tasks running to decide if SLA has
> >>> been
> >>>>>>>> breached. It's the task itself that will get the "soft" timeout
> >>> and
> >>>>>>>> propagate it (and then whoever receives the callback can decide
> >>> what
> >>>>> to
> >>>>>> do
> >>>>>>>> next - and this "callback" can happen in either the task context
> >>> or
> >>>> it
> >>>>>>>> could be done in a DagFileProcessor context as we do currently -
> >>>>> though
> >>>>>>>> the
> >>>>>>>> in-task processing seems much more distributed and scalable in
> >>>> nature.
> >>>>>>>> There is one watch-out here that this is not **guaranteed** to
> >>> work,
> >>>>>> there
> >>>>>>>> are cases, that we already saw that the SIGALRM is not going to
> >> be
> >>>>>> handled
> >>>>>>>> locally, when the task uses long running C-level function that
> >> is
> >>>> not
> >>>>>>>> written in the way to react to signals generated in Python
> >> (thinks
> >>>>>>>> low-level long-running Pandas c-method call that does not check
> >>>> signal
> >>>>>> in
> >>>>>>>> a
> >>>>>>>> long-running-loop. That however probably could be handled by one
> >>>> more
> >>>>>>>> process fork and have a dedicated child process that would
> >> monitor
> >>>>>> running
> >>>>>>>> tasks from a separate process - and we could actually improve
> >> both
> >>>>>> timeout
> >>>>>>>> and SLA handling by introducing such extra forked process to
> >>> handle
> >>>>>>>> timeout/task level time_limit_sla, so IMHO this is an
> >> opportunity
> >>> to
> >>>>>>>> improve things.
> >>>>>>>>
> >>>>>>>> I would love to hear what others think about it :)? I think our
> >>> SLA
> >>>>> for
> >>>>>>>> fixing SLA is about to run out.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> J.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jun 15, 2023 at 4:05 PM Sung Yun <sy...@cornell.edu>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello!
> >>>>>>>>>
> >>>>>>>>> Thank you very much for the feedback on the proposal. I’ve
> >> been
> >>>>> hoping
> >>>>>>>> to
> >>>>>>>>> get some more traction on this proposal, so it’s great to hear
> >>>> from
> >>>>>>>> another
> >>>>>>>>> user of the feature.
> >>>>>>>>>
> >>>>>>>>> I understand that there’s a lot of support for keeping a
> >> native
> >>>> task
> >>>>>>>> level
> >>>>>>>>> SLA feature, and I definitely agree with that sentiment. Our
> >>>>>>>> organization
> >>>>>>>>> very much relies on Airflow to evaluate ‘task_sla’ in order to
> >>>> keep
> >>>>>>>> track
> >>>>>>>>> of which tasks in each dags failed to succeed by an expected
> >>> time.
> >>>>>>>>>
> >>>>>>>>> In the AIP I put together on the Confluence page, and in the
> >>>> Google
> >>>>>>>> docs,
> >>>>>>>>> I have identified why the existing implementation of the task
> >>>> level
> >>>>>> SLA
> >>>>>>>>> feature can be problematic and is often misleading for Airflow
> >>>>> users.
> >>>>>>>> The
> >>>>>>>>> feature is also quite costly for Airflow scheduler and
> >>>>> dag_processor.
> >>>>>>>>>
> >>>>>>>>> In that sense, the discussion is not about whether or not
> >> these
> >>>> SLA
> >>>>>>>>> features are important to the users, but much more technical.
> >>> Can
> >>>> a
> >>>>>>>>> task-level feature be supported in a first-class way as a core
> >>>>> feature
> >>>>>>>> of
> >>>>>>>>> Airflow, or should it be implemented by the users, for example
> >>> as
> >>>>>>>>> independent tasks by leveraging Deferrable Operators.
> >>>>>>>>>
> >>>>>>>>> My current thought is that only Dag level SLAs can be
> >> supported
> >>>> in a
> >>>>>>>>> non-disruptive way by the scheduler, and that task level SLAs
> >>>> should
> >>>>>> be
> >>>>>>>>> handled outside of core Airflow infrastructure code. If you
> >>>> strongly
> >>>>>>>>> believe otherwise, I think it would be helpful if you could
> >>>> propose
> >>>>> an
> >>>>>>>>> alternative technical solution that solves many of the
> >> existing
> >>>>>>>> problems in
> >>>>>>>>> the task-level SLA feature.
> >>>>>>>>>
> >>>>>>>>> Sent from my iPhone
> >>>>>>>>>
> >>>>>>>>>> On Jun 13, 2023, at 1:10 PM, Ярослав Поскряков <
> >>>>>>>>> yaroslavposkrya...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Mechanism of SLA
> >>>>>>>>>>
> >>>>>>>>>> Hi, I read the previous conversation regarding SLA and I
> >> think
> >>>>>>>> removing
> >>>>>>>>> the
> >>>>>>>>>> opportunity to set sla for the task level will be a big
> >>> mistake.
> >>>>>>>>>> So, the proposed implementation of the task level SLA will
> >> not
> >>>> be
> >>>>>>>> working
> >>>>>>>>>> correctly.
> >>>>>>>>>>
> >>>>>>>>>> That's why I guess we have to think about the mechanism of
> >>> using
> >>>>>> SLA.
> >>>>>>>>>>
> >>>>>>>>>> I guess we should check three different cases in general.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 1. It doesn't matter for us how long we are spending time on
> >>>> some
> >>>>>>>>> specific
> >>>>>>>>>> task. It's important to have an understanding of the lag
> >>> between
> >>>>>>>>>> execution_date of dag and success state for the task. We can
> >>>> call
> >>>>> it
> >>>>>>>>>> dag_sla. It's similar to the current implementation of
> >>>>> manage_slas.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2. It's important to have an understanding and managing how
> >>> long
> >>>>>> some
> >>>>>>>>>> specific task is working. In my opinion working is the state
> >>>>> between
> >>>>>>>> task
> >>>>>>>>>> last start_date and task first (after last start_date)
> >> SUCCESS
> >>>>>> state.
> >>>>>>>> So
> >>>>>>>>>> for example for the task which is placed in FAILED state we
> >>>> still
> >>>>>>>> have to
> >>>>>>>>>> check an SLA in that strategy. We can call it task_sla.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 3. Sometimes we need to manage time for the task in the
> >>> RUNNING
> >>>>>>>> state. We
> >>>>>>>>>> can call it time_limit_sla.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Those three types of SLA will cover all possible cases.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> So we will have three different strategies for SLA.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I guess we can use for dag_sla that idea -
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> For task_sla and time_limit_sla I prefer to stay with using
> >>>>>>>> SchedulerJob
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Github: Yaro1
> >>>>>>>>>
> >>>>>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Sung Yun
> >>>> Cornell Tech '20
> >>>> Master of Engineering in Computer Science
> >>>>
> >>>
> >>
> >>
> >> --
> >> Sung Yun
> >> Cornell Tech '20
> >> Master of Engineering in Computer Science
> >>
>

Re: [DISCUSS] Mechanism of SLA

Reply via email to