Hi everyone, First time post to the dev list ... please be gentle!
I raised a PR to fix SLA alerts ( https://github.com/apache/airflow/pull/25489) but it's not trivial. Jared Potiuk asked that I flag it up here where it might get more attention, and I'm happy to oblige. *A brief summary of the problem* To the end-user, adding SLAs means that your system stops processing changes to the dag files. *A brief summary of the cause* Adding a SLA to a task in a dag means SLA callbacks get raised and passed back to the DagProcessorManager. The callbacks can get created in relatively large volumes, enough that the manager's queue never empties. This in turn causes the manager to stop checking the file-system for changes *A brief summary of the fix* The changes are confined to the manager.py file. What I have done is: 1. split the manager's queue into two (a standard and a priority queue) 2. track processing of the dags from disk independently of the queue, so that we'll rescan even if the queue is not empty. 3. added a config flag that causes the manager to scan stale dags on disk, even if there are a a lot of priority callbacks. This means that SLA callbacks are processed and alerts are raised in a timely fashion, but we continue to periodically scan the file system for files and process any changes. *Other notes* 1. First and foremost, if you're interested, please do have a look at the PR. I have done my best to document it thoroughly. There are new tests too! 2. The goal here is simply to make it so that adding SLAs doesn't kill the rest of the system. I haven't changed how they're defined, how the system raises them, or anything else. It's purely a fix to the queue(s) inside the manager. It's as low touch as I could make it. 3. I do have a *much* simpler fix (one line change), which works, but isn't perfect, particularly under certain config settings. This change is more complicated, but I think solves the problem "properly". That's it. Thanks for reading! A
