syun64 commented on PR #8545: URL: https://github.com/apache/airflow/pull/8545#issuecomment-1491091885
Hi everyone, I've spent a lot of time collecting all reported concerns that the community has had regarding SLAs to date. After much deliberation, I've reached the conclusion that we might be better off defining the Airflow-native SLA feature only at the DAG level, where it can be supported to users' expectations in a first-class way, and leave the task-level SLA definition to the users. There are three main reasons to why I think task-level SLAs should be implemented by the users instead of by Airflow. 1. Today, users have the ability to monitor Task-level SLAs through the use of Deferrable Operators and Asynchronous DateTimeTriggers (and Task groups to organize these tasks on the UI). 2. Reliably tracking task-level SLAs when the task actually misses the SLA (instead of only after the task succeeds) is only possible at the expense of overloading the work of the scheduler with task-level SLA detection - which is not ideal because task-level SLA detection is not the primary function of a scheduler, and it wouldn't be beneficial for Airflow users to compromise the scheduler in any way. 3. Some users want to customize the way they monitor the Task-level SLAs. Some want to use different definitions of the timedelta (timedelta from dagrun start versus from task start), some want to detect task SLA misses multiple times (different levels of warning for delays), and some users want to detect the SLA miss only if the target task is in a certain state (unfinished state - RUNNING, finished state- SUCCESS/SKIPPED) In contrast, I believe DAG-level SLA will strictly be a positive feature. It will increase the general reliability of Airflow DAGs and even be able to alert us on job delays when [undefined behaviors happen](https://github.com/apache/airflow/issues/21225), all without negatively impacting the performance of the scheduler. If you have been interested in the SLA mechanism, or have been actively using the current version of the SLA mechanism, I would love to get your feedback on this proposal. I would love to work with you to try to come up with an SLA solution that meets user expectations! [Airflow Improvement Proposal](https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-57+SLA+as+a+DAG-Level+Feature) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
