dstandish commented on issue #49508:
URL: https://github.com/apache/airflow/issues/49508#issuecomment-2945364573
Yeah Collin I think we need clarification on the repro scenario / under what
conditions starvation occurs.
Let me go through your post carefully
> Two DAGs each receive a large batch of DAG Runs. The number of runs for
each DAG exceeds max_dagruns_per_loop_to_schedule.
What state are the "received" in? I assume you mean triggered via API?
> Each DAG run is very short, shorter than the heartrate of this Airflow
deployment.
What is the signifigance of the runs being very short? What do you think
that has to do with this?
> Both DAGs have a max_active_runs that is far less than dagruns_per_loop.
Why? So that we can be confident that the scheduler _should_ fetch some of
these runs in the query?
> So: max_active_runs < max_dagruns_per_loop_to_schedule < number of queued
DAG runs.
> Each scheduler loop, there are a very small number of DAG Run "slots" for
the first DAG, so the check coalesce(running_drs.c.num_running, text("0")) <
coalesce(Backfill.max_active_runs, DagModel.max_active_runs), does not apply.
But then all the DAG runs that are considered are from the first DAG. So Second
DAG effectively has to wait for nearly all of First DAG's runs to complete
before any of its runs are moved from queued to running.
Could it be that what you were observing had to do with last scheduling
decision not getting updated because tasks were backed up somehow?
I guess either way, this seems like a very rare edge case scenario and, if
two people tried and failed it, not sure it's worth contiuning to try without
new information.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]