argibbs opened a new pull request, #30076:
URL: https://github.com/apache/airflow/pull/30076

   # Let's make SLA's work
   
   OK, so https://github.com/apache/airflow/pull/25147 made a start in this 
direction. Summing up the, er, summary from that MR, the problem was that SLA 
callbacks could keep occuring, and prevent the dag processor manager from ever 
processing more than 2 or 3 dags in the queue before the SLA callbacks re-upped 
and went to the front of the queue.
   
   After that PR was merged, the SLA callbacks went to the back of the queue. 
This guaranteed that the queue would be processed at least once. However, it 
turns out that dags on disk would only be re-added to the queue once the queue 
was empty. But with SLA callbacks arriving All. The. Time. the queue would 
never drain, and we'd never re-read dags from disk. So if you updated the dag 
file, you'd have to bounce the scheduler to pick up the change, and then it 
would process all non-SLA-generating DAGs exactly once. And then you'd need to 
bounce again.
   
   ### Related Issues
   
   Closes https://github.com/apache/airflow/issues/15596 (I hope!)
   
   # Pay attention, here comes the science bit
   
   Ok, so to briefly recap the queue behaviour prior to this change:
   1. The scheduler spins up.
   2. The worker loop starts.
   3. It scans the disk to create a list of all known files. _It does not add 
these files to the queue_.
   4. If the queue is empty (on start up it will be) then all known files are 
added to the queue.
   5. It then spins up dag processors (via multiprocessing or in-line, 
depending on set up) to consume the first N entries in the queue (where N is a 
function of config settings, and includes a count of any files currently being 
processed from a previous pass of the loop)
   6. It then goes back to step 2.
   
   - Note that Step 3 is re-run periodically based on a time interval set in 
config (`dag_dir_list_interval`). It does not require that the queue be empty. 
Any changes to the set of files on disk e.g. dags being deleted will cause the 
manager to remove dags from the queue.
   - After a few iterations of the loop, the contents of the queue are 
processed, and step 4 kicks in, and the dag files present on disk are re-added 
to the queue again.
       - The default config is to do this no more than once every 30 seconds 
(`min_file_process_interval`), so if your dag files only take 10 seconds to 
process, there will be 20 seconds of idle time, but if your dag files take a 
minute to process, then the manager will be permanently busy, because as soon 
as the queue drains, it'll be well past time to reload the queue from disk.
       - This only works if the queue _actually empties_ - as foreshadowed 
above, if the queue never empties (because SLAs) then step 4 will never kick 
in, and we'll never reload the queue. Step 3 will ensure that within a short 
interval any deleted dags are removed from the queue (should they happen to be 
one of the ones generating SLAs) but updated and newly added dag files will not 
be picked up.
       - To be clear, this is the problem I alluded to in my comments at the 
end of the previous PR after it got merged.
   
   # What this PR changes
   
   The fix is extremely simple - although the SLA callback is registered with 
the manager, simply getting an SLA callback no longer causes the dag to be 
added to the queue. This ensures that the queue will periodically empty, and 
thus all existing dag files will be re-added on a regular basis. If you update 
a dag file, the change will be picked up within `min_file_process_interval`.
   
   The trade-off is that the SLA callback will not be processed until its dag 
file is processed as part of the periodic reload of all dag files onto the 
queue.
   
   I have also added some comments and some extra metrics which I find useful, 
but the only functional change is the one above.
   
   # Seems familiar?
   
   I actually tried to make a change in this area last year (see #25489 / 
#27317). The change in that PR was a lot more involved as it tried to close 
several possible edge cases. While I was confident in that change (I have been 
running it in prod for months using a locally patched copy), I can understand 
the need for caution.
   
   This change is less invasive. If you don't use SLA's there's no impact. If 
you _do_ use SLAs and sometimes find the system doesn't pick up changes to your 
dags ... well, this fixes that.
   
   # Loose ends
   
   The previous PR was more invasive, but closed a few additional edge cases:
   1. Large numbers of non-SLA callbacks having the same effect and DoS'ing the 
queue:  
   In my locally patched copy, this _never_ happens. Maybe other people will 
have different experiences in the wild, but (by virtue of having the additional 
queue metrics) I was able to see that the numbers were never anything like as 
big as the SLA callbacks
   2. Increased time to SLA alert:  
   Yes, this does mean that a SLA callback will not be picked up immediately - 
you'll have to wait until all the dag files have been processed and the queue 
is refreshed; this will either be `min_file_process_interval` or the time 
needed to process all your dag files, whichever is longer. But by default 
`min_file_process_interval` is 30 seconds, which is still pretty quick.
   
   I'm happy with the above trade-offs, given that this change is so small vs 
the impact it has (finally fixes SLAs!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to