argibbs opened a new pull request, #30076: URL: https://github.com/apache/airflow/pull/30076
# Let's make SLA's work OK, so https://github.com/apache/airflow/pull/25147 made a start in this direction. Summing up the, er, summary from that MR, the problem was that SLA callbacks could keep occuring, and prevent the dag processor manager from ever processing more than 2 or 3 dags in the queue before the SLA callbacks re-upped and went to the front of the queue. After that PR was merged, the SLA callbacks went to the back of the queue. This guaranteed that the queue would be processed at least once. However, it turns out that dags on disk would only be re-added to the queue once the queue was empty. But with SLA callbacks arriving All. The. Time. the queue would never drain, and we'd never re-read dags from disk. So if you updated the dag file, you'd have to bounce the scheduler to pick up the change, and then it would process all non-SLA-generating DAGs exactly once. And then you'd need to bounce again. ### Related Issues Closes https://github.com/apache/airflow/issues/15596 (I hope!) # Pay attention, here comes the science bit Ok, so to briefly recap the queue behaviour prior to this change: 1. The scheduler spins up. 2. The worker loop starts. 3. It scans the disk to create a list of all known files. _It does not add these files to the queue_. 4. If the queue is empty (on start up it will be) then all known files are added to the queue. 5. It then spins up dag processors (via multiprocessing or in-line, depending on set up) to consume the first N entries in the queue (where N is a function of config settings, and includes a count of any files currently being processed from a previous pass of the loop) 6. It then goes back to step 2. - Note that Step 3 is re-run periodically based on a time interval set in config (`dag_dir_list_interval`). It does not require that the queue be empty. Any changes to the set of files on disk e.g. dags being deleted will cause the manager to remove dags from the queue. - After a few iterations of the loop, the contents of the queue are processed, and step 4 kicks in, and the dag files present on disk are re-added to the queue again. - The default config is to do this no more than once every 30 seconds (`min_file_process_interval`), so if your dag files only take 10 seconds to process, there will be 20 seconds of idle time, but if your dag files take a minute to process, then the manager will be permanently busy, because as soon as the queue drains, it'll be well past time to reload the queue from disk. - This only works if the queue _actually empties_ - as foreshadowed above, if the queue never empties (because SLAs) then step 4 will never kick in, and we'll never reload the queue. Step 3 will ensure that within a short interval any deleted dags are removed from the queue (should they happen to be one of the ones generating SLAs) but updated and newly added dag files will not be picked up. - To be clear, this is the problem I alluded to in my comments at the end of the previous PR after it got merged. # What this PR changes The fix is extremely simple - although the SLA callback is registered with the manager, simply getting an SLA callback no longer causes the dag to be added to the queue. This ensures that the queue will periodically empty, and thus all existing dag files will be re-added on a regular basis. If you update a dag file, the change will be picked up within `min_file_process_interval`. The trade-off is that the SLA callback will not be processed until its dag file is processed as part of the periodic reload of all dag files onto the queue. I have also added some comments and some extra metrics which I find useful, but the only functional change is the one above. # Seems familiar? I actually tried to make a change in this area last year (see #25489 / #27317). The change in that PR was a lot more involved as it tried to close several possible edge cases. While I was confident in that change (I have been running it in prod for months using a locally patched copy), I can understand the need for caution. This change is less invasive. If you don't use SLA's there's no impact. If you _do_ use SLAs and sometimes find the system doesn't pick up changes to your dags ... well, this fixes that. # Loose ends The previous PR was more invasive, but closed a few additional edge cases: 1. Large numbers of non-SLA callbacks having the same effect and DoS'ing the queue: In my locally patched copy, this _never_ happens. Maybe other people will have different experiences in the wild, but (by virtue of having the additional queue metrics) I was able to see that the numbers were never anything like as big as the SLA callbacks 2. Increased time to SLA alert: Yes, this does mean that a SLA callback will not be picked up immediately - you'll have to wait until all the dag files have been processed and the queue is refreshed; this will either be `min_file_process_interval` or the time needed to process all your dag files, whichever is longer. But by default `min_file_process_interval` is 30 seconds, which is still pretty quick. I'm happy with the above trade-offs, given that this change is so small vs the impact it has (finally fixes SLAs!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
