GitHub user Asquator edited a discussion: Redesign the scheduler logic to avoid 
starvation due to dropped tasks in critical section

The way critical section works now is: 

1. Fire a `select` query and get at most `max_tis` task instances to schedule 
2. Loop over tasks to check concurrency limits and find tasks eligible to 
scheduling 
3. If at least one task instance is found, exit and send the good tasks to 
executors 
4. Otherwise, update `starved_` filters and try again 

The third step can cause any amount of tasks to be dropped due to concurrency 
limits (as long as there is at least one ready task found), and only a few 
tasks will survive. At the same time, ready tasks will queue up in the table 
without getting the chance to run. This can cause to tasks being starved for a 
long time in edge cases like almost-starved prioritized pools, as pointed out 
here:

https://github.com/apache/airflow/issues/45636

We have to rethink the scheduler logic (the query or the loop altogether) to 
avoid this kind of starvation.


GitHub link: https://github.com/apache/airflow/discussions/49160

----
This is an automatically sent email for commits@airflow.apache.org.
To unsubscribe, please send an email to: commits-unsubscr...@airflow.apache.org

Reply via email to