leonsmith commented on issue #18501:
URL: https://github.com/apache/airflow/issues/18501#issuecomment-933175508


   So we just come across this issue.
   
   Our reproduction steps are similar too.
   
   We enabled a new daily dag with ‘catchup=True’ and a start date 3 years in 
the past, this caused the scheduler to correctly create all 1000+ dag runs and 
to start “catching up”.
   
   However this starved other dagsruns from being executed in the environment 
until the dag was fully caught up, we also had a low max active dag runs which 
has helped identify the issue.
   
   The new dagruns are moving from scheduled to running correctly in the 
scheduler loop and the scheduler correctly drops dagruns that have hit the 
dagrun limit.
   
   However what appears to be happening is if a dagrun is under its max 
concurrent limit the dagruns per loop query will still return all the dagruns 
for this dag even though it can only launch a small number more up to its 
limit. 
   It effectively limits the throughout of the scheduler loop because a high 
percentage of the dagruns are not actually eligible to run and the last 
scheduled decision date is not updated.
   
   Fixing the last scheduled decision date as already merged in #17945 does 
allow other tasks to be picked up but only after every dagrun has been 
inspected once.
   
   Thus this is actually a compound issue as once the above is fixed it turns 
into slower dag execution rather than a stalling/blocking issue.
   
   The above fix also has a side effect of making the dagruns out of order but 
I don’t think that was every a guarantee that new dagruns would be executed in 
order unless they are marked to depend on the past.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to