argibbs opened a new pull request, #25147: URL: https://github.com/apache/airflow/pull/25147
# Problem When SLAs are enabled, DAG processing can grind to a halt. this manifests as updates to dag files being ignored: newly added dags do not appear, and changes to existing dags do not take effect. The behaviour will be seemingly random - some dags will update, others not. The reason is because internally the DagProcessorManager maintains a queue of dags to parse and update. Dags get into this queue by a couple of mechanisms. The first and obvious one is the scanning of the file system for dag files. However, dags can also get into this queue during evaluation of the dag's state by the scheduler (`scheduler_job.py`). Since these event-based callbacks presumably require more rapid reaction than a regular scan of the file system, they go to the front of the queue. And this is how SLAs break the system; prior to this MR they are treated the same as other callbacks, i.e. they cause their file to go to the front of the queue. The problem is that SLAs are evaluated per dag file, but a single dag may have many tasks with SLAs. Thus the evaluation of a single DAG may generate _many_ SLA callbacks. These cause the affected file to go to the front of the queue. It's re-evaluated, and then the SLA events are fired again. What this means in practice is that you will see the DagProcessorManager process a dag file with SLAs, move onto the next file in the queue, maybe even make it to 2 or 3 more dags ... and then more SLAs callbacks arrive from the first dag and reset the queue. The DagProcessorManager never makes it all the way to the end of its queue. # Solution It's pretty simple: the DagProcessorManager queue is altered s.t. SLA callbacks are added (if they don't already exist - remember they're processed per-dag, but generated one per task-with-SLA), and when added they do not change the place of the dag file in the queue. If it's not in the queue, it's added at the back. # Notes This may feel a bit sticky-tape-and-string; you could argue that the SLACallbacks shouldn't be generated so rapidly. However, the only thing that knows the state of the queue is the DagProcessorManager, and it's unreasonable to expect the DagProcessors to throttle themselves without knowing whether such throttling is necessary. To put it another way, more optimisations in the DagProcessors are possible, but having the queue gate the callbacks as they're added is necessary and sufficient to stop the SLAs spamming the queue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
