leonsmith commented on issue #18501: URL: https://github.com/apache/airflow/issues/18501#issuecomment-933175508
So we just come across this issue. Our reproduction steps are similar too. We enabled a new daily dag with ‘catchup=True’ and a start date 3 years in the past, this caused the scheduler to correctly create all 1000+ dag runs and to start “catching up”. However this starved other dagsruns from being executed in the environment until the dag was fully caught up, we also had a low max active dag runs which has helped identify the issue. The new dagruns are moving from scheduled to running correctly in the scheduler loop and the scheduler correctly drops dagruns that have hit the dagrun limit. However what appears to be happening is if a dagrun is under its max concurrent limit the dagruns per loop query will still return all the dagruns for this dag even though it can only launch a small number more up to its limit. It effectively limits the throughout of the scheduler loop because a high percentage of the dagruns are not actually eligible to run and the last scheduled decision date is not updated. Fixing the last scheduled decision date as already merged in #17945 does allow other tasks to be picked up but only after every dagrun has been inspected once. Thus this is actually a compound issue as once the above is fixed it turns into slower dag execution rather than a stalling/blocking issue. The above fix also has a side effect of making the dagruns out of order but I don’t think that was every a guarantee that new dagruns would be executed in order unless they are marked to depend on the past. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
