[
https://issues.apache.org/jira/browse/AIRFLOW-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ry Walker updated AIRFLOW-72:
-----------------------------
Affects Version/s: (was: 1.7.1)
> Implement proper capacity scheduler
> -----------------------------------
>
> Key: AIRFLOW-72
> URL: https://issues.apache.org/jira/browse/AIRFLOW-72
> Project: Apache Airflow
> Issue Type: Improvement
> Components: pools, scheduler
> Reporter: Bolke de Bruin
> Priority: Major
> Labels: pool, queue, scheduler
> Fix For: 2.0.0
>
>
> The scheduler is supposed to maintain queues and pools according to a
> "capacity" model. However it is currently not properly implemented as
> therefore issues as being able to oversubscribe to pools exist, race
> conditions for queuing/dequeuing exist and probably others.
> This Jira Epic is to track all related issues to pooling/queuing and the
> (tbd) roadmap to a proper capacity scheduler.
> Why queuing / scheduling broken:
> Locking is not properly implemented and cannot be as a check for slot
> availability is spread throughout the scheduler, taskinstance and executor.
> This makes obtaining a slot non-atomic and results in over subscribing. In
> addition it leads to race conditions as having two tasks being picked from
> the queue at the same time as the scheduler determines that a queued task
> still needs to be send to the executor, while in an earlier run this already
> happened.
> In order to fix this Pool handling needs to be centralized (code wise) and
> work with a mutex (with_for_update()) on the database records. The
> scheduler/taskinstance can then do something like:
> slot = Pool.obtain_slot(pool_id)
> Pool.release_slot(slot)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)