[GitHub] [airflow] argibbs commented on pull request #30076: SLA callbacks no longer add files to the dag_processing manager queue

via GitHub Tue, 14 Mar 2023 02:56:29 -0700


argibbs commented on PR #30076:
URL: https://github.com/apache/airflow/pull/30076#issuecomment-1467768971


   > Why were the DAGs added back to the queue previously? I wonder if this 
would break something else.
   
   The queue is just a straight up list of "these dags need to be evaluated". 
When a dag is picked off the queue and processed, any callbacks that are 
currently waiting for it are processed at the same time.
   
   I've not gone on a code archaeology hunt, but if I had to guess, callbacks 
were added later on. If you don't have callbacks, then the logic is dead 
simple: fetch all dag files on disk, work your way through them, **wait until 
the queue is empty** and `min_process_interval` expires, rinse and repeat.
   
   As far as the queue is concerned, callbacks are simply a way to jump the 
line. A dag with a callback is added to the queue immediately, even if it was 
only just processed (or in fact, even if it's _currently_ being processed). The 
problem is that because the vanilla logic requires that the queue empties 
before it will refill it with all the dags on disk, if you are generating a lot 
of callbacks and constantly topping up the queue, you never refill it with all 
the dags, and only the ones that are generating callbacks ever get processed. 
And that's the problem with SLA callbacks - they generate _so many updates_ 
that the queue never drains.
   
   There are several other way that you could try to fix this:
   1. Reduce the frequency of SLA callbacks - it's definitely the case that 
they are generated too aggressively, but given the nature of the problem I 
think you would simply be reducing the likelihood that you DoS the queue, not 
eliminating it altogether. 
   2. Change the queue logic to ensure that all files are periodically added 
back to the queue, even if it's not empty yet. This was actually in the 
previous PR I raised, but it seemed clear to me that there was (is?) no 
appetite for a more involved change in this area of the code (which I 100% 
understand and respect).
   
   Hence my second attempt to fix the problem with this change. It's super 
minimal - the only affected codepath is explicitly SLAs. It definitely stops 
SLA's breaking the system. And SLA's still fire (just not as quickly), because 
you have to wait ~30 seconds for the vanilla queue logic to drain and refill. 
Could the logic be changed to let you have your cake and eat it - fast SLA 
callbacks and a working queue? Absolutely, but that's a bigger change. 
   
   From a usability stand point, I'm firmly in the "build it and they will 
come" camp. SLAs are rarely used because they break the system. Having had 
working SLAs in our system for months, I think they're awesome. Buggy yes, but 
once they fire reliably, I'm willing to bet they'll become more popular and 
there'll be more appetite for improving them. And if I'm wrong, well, at least 
I don't have to keep patching our install on every upgrade. :laughing: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] argibbs commented on pull request #30076: SLA callbacks no longer add files to the dag_processing manager queue

Reply via email to