I think generally that SLA has been broken for quite some time and while we are continuously saying "we need to fix it" there is no effort to do so. Not sure if this is because it is unimportant. But also maybe the feature actually is useful in a number of cases and not broken too much. And maybe we can fix it in some incremental steps rather than totally rewriting it.
In this context - this change is supposed to fix one of the problems with SLA - if there are too many of those (which can be generated if you have some failures) it actually might get to the point that it will block dag file processor from doing the "Real" stuff. Adding priority queue is a good idea I think. And it's is a good first step. There could be other ideas in place (de-duplicating tha callbacks is I think already implemented), maybe simply dropping some of the callbacks eventually. But I think this looks promising to start with that. WDYT everyone? Is there anyone else planning to do any serious stuff on SLA callbacks? Should we take a close look at that one. Mateusz Henc, I think this might also have some - positive impact on finalizing the AIP-43? J. On Thu, Aug 4, 2022 at 10:09 PM Andrew Gibbs <[email protected]> wrote: > Hi everyone, > > First time post to the dev list ... please be gentle! > > I raised a PR to fix SLA alerts ( > https://github.com/apache/airflow/pull/25489) but it's not trivial. Jared > Potiuk asked that I flag it up here where it might get more attention, and > I'm happy to oblige. > > > *A brief summary of the problem* > To the end-user, adding SLAs means that your system stops processing > changes to the dag files. > > *A brief summary of the cause* > Adding a SLA to a task in a dag means SLA callbacks get raised and passed > back to the DagProcessorManager. The callbacks can get created in > relatively large volumes, enough that the manager's queue never empties. > This in turn causes the manager to stop checking the file-system for > changes > > *A brief summary of the fix* > The changes are confined to the manager.py file. What I have done is: > 1. split the manager's queue into two (a standard and a priority queue) > 2. track processing of the dags from disk independently of the queue, so > that we'll rescan even if the queue is not empty. > 3. added a config flag that causes the manager to scan stale dags on disk, > even if there are a a lot of priority callbacks. > > This means that SLA callbacks are processed and alerts are raised in a > timely fashion, but we continue to periodically scan the file system for > files and process any changes. > > *Other notes* > 1. First and foremost, if you're interested, please do have a look at the > PR. I have done my best to document it thoroughly. There are new tests too! > 2. The goal here is simply to make it so that adding SLAs doesn't kill the > rest of the system. I haven't changed how they're defined, how the system > raises them, or anything else. It's purely a fix to the queue(s) inside the > manager. It's as low touch as I could make it. > 3. I do have a *much* simpler fix (one line change), which works, but > isn't perfect, particularly under certain config settings. This change is > more complicated, but I think solves the problem "properly". > > That's it. Thanks for reading! > > A > >
