Thanks Sung Yun, Fantastic document!
This is one incredibly well thought out document and proposal. I like the detailed analysis of the problem (finally it answers very clearly and explicitly why the current SLA mechanism is essentially broken). It also very well reflects the changes in our approach with dataset-driven scheduling where we promote using "micro-pipeline" approach, where you split more monolithic DAGs to a number of smaller dags linked by dataset dependencies. Before that, DAG-level SLAs would not be as complete. Simply without the "micro-pipeline" approach, DAG-level SLAs would be far too coarse-grained and task-level SLAs were too fine-grained. But when you pair dataset-driven scheduling and DAG level SLA, you generally leave the decision to the user what granularity they would like to have for those. And this is great. I think the doc is very close to be AIP-ready, I think what I miss are few things only: * an example and proposal on how the definition of the "new" SLA would be different from the current one and I think this proposal should already include immediate deprecation of the old SLA mechanism (for the reasons described in the docs). * also you mention that the current SLA approach can be replaced with time Triggers. Few examples of how you could do that, showing a path for the users on how they could migrate their current sla approach would be incredibly useful and it would show that the deprecation path is viable (and easy) even if someone would like to keep task level SLA implemented this way and justify the immediate deprecation. J. On Fri, Mar 17, 2023 at 5:22 AM Sung Yun <[email protected]> wrote: > > Dear Airflow community, > > I've been looking to make a new SLA proposal by crowdsourcing all the known > issues and proposals that have been raised over the last couple of years. > In summary, I believe that defining SLAs at the task level puts too much > work on the scheduler, if we were to solve all of the known issues. Given > this clear downside, task-level callbacks may not be strictly necessary, > especially with tools like DateTimeTriggers that can substitute the > function of task-level SLA callbacks. > > On the other hand, I believe that SLAs defined at the DAG level will be > extremely useful as a 'catch-all' alert in case anything goes wrong in a > DAG (e.g. significant delays in task execution, tasks stuck in queued > state). In addition, SLAs defined at the DAG level will be incredibly > lightweight to detect and execute callbacks for. Hence, I'd like to propose > that SLAs be defined as a DAG-level attribute. If you are interested in > this feature, please take a look at my detailed proposal and example > implementations in the below Google Doc. Please feel free to reply to this > message, or simply leave comments in the Doc. > > https://docs.google.com/document/d/1drNaYmAy6GqC4WGGn4MNt6VqbOwVNm7jPfmr5Pc52AU/edit?usp=sharing > > Thank you! > > Github: syun64 <https://github.com/syun64> > -- > Sung Yun > Cornell '20 > Master of Engineering in Computer Science --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
