Re: [D] Redesign the scheduler logic to avoid starvation due to dropped tasks in critical section [airflow]

via GitHub Sat, 12 Apr 2025 09:00:30 -0700


GitHub user Asquator edited a discussion: Redesign the scheduler logic to avoid 
starvation due to dropped tasks in critical section


The way critical section works now is: 

1. Fire a `select` query and get at most `max_tis` task instances to schedule 
2. Loop over tasks to check concurrency limits and find tasks eligible to 
scheduling 
3. If at least one task instance is found, exit and send the good tasks to 
executors 
4. Otherwise, update `starved_` filters and try again 

The third step can cause any amount of tasks to be dropped due to concurrency 
limits (as long as there is at least one ready task found), and only few tasks 
will survive. At the same time, ready tasks will queue up in the table without 
getting the chance to run. This can cause tasks to starve for a long time in 
edge cases like almost full prioritized pools, as pointed out here:

https://github.com/apache/airflow/issues/45636

We have to rethink the scheduler logic (the query or the loop altogether) to 
avoid this kind of starvation.


GitHub link: https://github.com/apache/airflow/discussions/49160

----
This is an automatically sent email for commits@airflow.apache.org.
To unsubscribe, please send an email to: commits-unsubscr...@airflow.apache.org

Re: [D] Redesign the scheduler logic to avoid starvation due to dropped tasks in critical section [airflow]

Reply via email to