Gabriel Silk created AIRFLOW-2430:
-------------------------------------
Summary: Bad query patterns at scale prevent scheduler from
starting
Key: AIRFLOW-2430
URL: https://issues.apache.org/jira/browse/AIRFLOW-2430
Project: Apache Airflow
Issue Type: Bug
Components: scheduler
Reporter: Gabriel Silk
h2. Summary
Certain queries executed by the scheduler do not scale well with the number of
tasks being operated on. Two example functions
* reset_state_for_orphaned_tasks
* _execute_task_instances
Concretely — with a mere 75k tasks being operated on, the first query can take
dozens of minutes to run, blocking the scheduler from making progress.
The cause is twofold:
1. As the query grows past a certain point, the MySQL planner will choose to do
a full table scan as opposed to using an index. I assume the same is true of
Postgres.
2. The query predicate size grows linearly in the number of tasks being
operated, thus increasing the amount of work that needs to be done per row.
In a sense, you’re left with an operation that scales O(n^2)
h2. Proposed Fix
It appears that one of these bad query patterns was fixed in
[3547cbffd|https://github.com/apache/incubator-airflow/commit/3547cbffdbffac2f98a8aa05526e8c9671221025]
by introducing a configurable batch size with can be set via max_tis_per_query.
I propose we extend the suggested fix to include other poorly-performing
queries in the scheduler.
I’ve identified two queries that are directly affecting my work and included
them in the diff, though the same approach can be extended to more queries as
we see fit.
Thanks!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)