Re: [D] Redesign the scheduler logic to avoid starvation due to dropped tasks in critical section [airflow]

via GitHub Sat, 12 Apr 2025 08:50:49 -0700


GitHub user Asquator edited a discussion: Redesign the scheduler logic to avoid 
starvation due to dropped tasks in critical section


The way critical section works now is: 

1. Fire a `select` query and get at most `max_tis` task instances to schedule 
2. Loop over tasks to check concurrency limits and find tasks eligible to 
scheduling 
3. If at least one task instance is found, exit and send the good tasks to 
executors 
4. Otherwise, update `starved_` filters and try again 

The third step can cause any amount of tasks to be dropped due to concurrency 
limits (as long as there is at least one ready task found), and only a few 
tasks will survive. At the same time, ready tasks will queue up in the table 
without getting the chance to run. This can cause to tasks being starved for a 
long time in edge cases like almost-starved prioritized pools, as pointed out 
here:

https://github.com/apache/airflow/issues/45636

We have to rethink the scheduler logic (the query or the loop altogether) to 
avoid this kind of starvation.


GitHub link: https://github.com/apache/airflow/discussions/49160

----
This is an automatically sent email for commits@airflow.apache.org.
To unsubscribe, please send an email to: commits-unsubscr...@airflow.apache.org

Re: [D] Redesign the scheduler logic to avoid starvation due to dropped tasks in critical section [airflow]

Reply via email to