Chris Riccomini created AIRFLOW-807:
---------------------------------------
Summary: Scheduler is very slow when a .py file has many DAGs in it
Key: AIRFLOW-807
URL: https://issues.apache.org/jira/browse/AIRFLOW-807
Project: Apache Airflow
Issue Type: Bug
Components: scheduler
Affects Versions: 1.8.0b2
Reporter: Chris Riccomini
Assignee: Chris Riccomini
Fix For: Airflow 1.8
While running Airflow 1.8.0b2 in production, we noticed a significant
performance issue with one of our DAGs.
The .py file (called db.py) generates a bunch of DAGs. This file was taking >
900 seconds for the scheduler to process, which was introducing significant
delays in our data pipeline.
We enabled slow_query log for MySQL, and saw that this query was taking more
than 10 seconds per DAG in the .py file:
{code:sql}
SELECT task_instance.task_id AS task_id, max(task_instance.execution_date) AS
max_ti
FROM task_instance
WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success' AND
task_instance.task_id IN ('t1', 't2') GROUP BY task_instance.task_id
{code}
This query is run inside jobs.py's manage_slas method. When running an explain,
we can see that MySQL is using the wrong index for it:
{noformat}
+----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+
| id | select_type | table | type | possible_keys
| key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+
| 1 | SIMPLE | task_instance | ref |
PRIMARY,ti_dag_state,ti_pool,ti_state_lkp,ti_state | ti_state | 63 | const
| 81898 | Using where; Using index |
+----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+
{noformat}
It's using ti_state, but should be using ti_primary. We tried running
ANALYZE/OPTIMIZE on the {{task_instance}} table, but it didn't improve the
query plan or performance time.
Next, we added a hint to the SqlAlchemy query object, which improved the
performance by about 10x, dropping the db.py parsing down to 90 seconds.
I then got another 2x boost by simply aborting the manage_slas method at the
start if the DAG has no tasks SLAs in it (none of our DAGs do). This dropped
the db.py parse time to 45-50 seconds.
This JIRA is to add a short circuit in manage_slas, and a hint for MySQL in the
query.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)