argibbs commented on PR #30076: URL: https://github.com/apache/airflow/pull/30076#issuecomment-1467768971
> Why were the DAGs added back to the queue previously? I wonder if this would break something else. The queue is just a straight up list of "these dags need to be evaluated". When a dag is picked off the queue and processed, any callbacks that are currently waiting for it are processed at the same time. I've not gone on a code archaeology hunt, but if I had to guess, callbacks were added later on. If you don't have callbacks, then the logic is dead simple: fetch all dag files on disk, work your way through them, **wait until the queue is empty** and `min_process_interval` expires, rinse and repeat. As far as the queue is concerned, callbacks are simply a way to jump the line. A dag with a callback is added to the queue immediately, even if it was only just processed (or in fact, even if it's _currently_ being processed). The problem is that because the vanilla logic requires that the queue empties before it will refill it with all the dags on disk, if you are generating a lot of callbacks and constantly topping up the queue, you never refill it with all the dags, and only the ones that are generating callbacks ever get processed. And that's the problem with SLA callbacks - they generate _so many updates_ that the queue never drains. There are several other way that you could try to fix this: 1. Reduce the frequency of SLA callbacks - it's definitely the case that they are generated too aggressively, but given the nature of the problem I think you would simply be reducing the likelihood that you DoS the queue, not eliminating it altogether. 2. Change the queue logic to ensure that all files are periodically added back to the queue, even if it's not empty yet. This was actually in the previous PR I raised, but it seemed clear to me that there was (is?) no appetite for a more involved change in this area of the code (which I 100% understand and respect). Hence my second attempt to fix the problem with this change. It's super minimal - the only affected codepath is explicitly SLAs. It definitely stops SLA's breaking the system. And SLA's still fire (just not as quickly), because you have to wait ~30 seconds for the vanilla queue logic to drain and refill. Could the logic be changed to let you have your cake and eat it - fast SLA callbacks and a working queue? Absolutely, but that's a bigger change. From a usability stand point, I'm firmly in the "build it and they will come" camp. SLAs are rarely used because they break the system. Having had working SLAs in our system for months, I think they're awesome. Buggy yes, but once they fire reliably, I'm willing to bet they'll become more popular and there'll be more appetite for improving them. And if I'm wrong, well, at least I don't have to keep patching our install on every upgrade. :laughing: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
