Hi everyone,

First time post to the dev list ... please be gentle!

I raised a PR to fix SLA alerts (
https://github.com/apache/airflow/pull/25489) but it's not trivial. Jared
Potiuk asked that I flag it up here where it might get more attention, and
I'm happy to oblige.


*A brief summary of the problem*
To the end-user, adding SLAs means that your system stops processing
changes to the dag files.

*A brief summary of the cause*
Adding a SLA to a task in a dag means SLA callbacks get raised and passed
back to the DagProcessorManager. The callbacks can get created in
relatively large volumes, enough that the manager's queue never empties.
This in turn causes the manager to stop checking the file-system for
changes

*A brief summary of the fix*
The changes are confined to the manager.py file. What I have done is:
1. split the manager's queue into two (a standard and a priority queue)
2. track processing of the dags from disk independently of the queue, so
that we'll rescan even if the queue is not empty.
3. added a config flag that causes the manager to scan stale dags on disk,
even if there are a a lot of priority callbacks.

This means that SLA callbacks are processed and alerts are raised in a
timely fashion, but we continue to periodically scan the file system for
files and process any changes.

*Other notes*
1. First and foremost, if you're interested, please do have a look at the
PR. I have done my best to document it thoroughly. There are new tests too!
2. The goal here is simply to make it so that adding SLAs doesn't kill the
rest of the system. I haven't changed how they're defined, how the system
raises them, or anything else. It's purely a fix to the queue(s) inside the
manager. It's as low touch as I could make it.
3. I do have a *much* simpler fix (one line change), which works, but isn't
perfect, particularly under certain config settings. This change is more
complicated, but I think solves the problem "properly".

That's it. Thanks for reading!

A

Reply via email to