Mechanism of SLA

Hi, I read the previous conversation regarding SLA and I think removing the
opportunity to set sla for the task level will be a big mistake.
So, the proposed implementation of the task level SLA will not be working
correctly.

That's why I guess we have to think about the mechanism of using SLA.

I guess we should check three different cases in general.


1. It doesn't matter for us how long we are spending time on some specific
task. It's important to have an understanding of the lag between
execution_date of dag and success state for the task. We can call it
dag_sla. It's similar to the current implementation of manage_slas.


2. It's important to have an understanding and managing how long some
specific task is working. In my opinion working is the state between task
last start_date and task first (after last start_date) SUCCESS state. So
for example for the task which is placed in FAILED state we still have to
check an SLA in that strategy. We can call it task_sla.


3. Sometimes we need to manage time for the task in the RUNNING state. We
can call it time_limit_sla.


Those three types of SLA will cover all possible cases.


So we will have three different strategies for SLA.


I guess we can use for dag_sla that idea -
https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals


For task_sla and time_limit_sla I prefer to stay with using SchedulerJob


Github: Yaro1

Reply via email to