[
https://issues.apache.org/jira/browse/AIRFLOW-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358041#comment-15358041
]
Kengo Seki commented on AIRFLOW-246:
------------------------------------
bq. it delivers different results than before, as it double counts the TI
belonging to the last DAG run if it is still running?
Indeed. I totally missed that. Thanks for pointing out!
bq. please open a JIRA about updating UPDATES.md.
As another option, adding {{AND (state IS NULL OR state <> 'running')}} to the
WHERE clause in the second subquery will fix this.
> dag_stats endpoint has a terrible query
> ---------------------------------------
>
> Key: AIRFLOW-246
> URL: https://issues.apache.org/jira/browse/AIRFLOW-246
> Project: Apache Airflow
> Issue Type: Bug
> Components: webserver
> Affects Versions: Airflow 1.7.1
> Environment: MySQL Backend through sqlalchemy
> Reporter: Neil Hanlon
> Assignee: Kengo Seki
> Fix For: Airflow 1.8
>
>
> Hitting this endpoint creates a series of queries on the database which take
> over 20 seconds to run, causing the page to not load for that entire time.
> Luckily the main page (which includes this under "Recent Statuses") loads
> this synchronously, but still... waiting almost half a minute (at times more)
> to see the statuses for dags is really not fun.
> We have less than a million rows in the task_instance table--so it's not even
> a problem with that.
> Here's a query profile for the query:
> https://gist.github.com/NeilHanlon/613f12724e802bc51c23fca7d46d28bf
> We've done some optimizations on the database, but to no avail.
> The query:
> {code:sql}
> SELECT task_instance.dag_id AS task_instance_dag_id, task_instance.state AS
> task_instance_state, count(task_instance.task_id) AS count_1 FROM
> task_instance LEFT OUTER JOIN (SELECT dag_run.dag_id AS dag_id,
> dag_run.execution_date AS execution_date FROM dag_run WHERE dag_run.state =
> 'running') AS running_dag_run ON running_dag_run.dag_id =
> task_instance.dag_id AND running_dag_run.execution_date =
> task_instance.execution_date LEFT OUTER JOIN (SELECT dag_run.dag_id AS
> dag_id, max(dag_run.execution_date) AS execution_date FROM dag_run GROUP BY
> dag_run.dag_id) AS last_dag_run ON last_dag_run.dag_id = task_instance.dag_id
> AND last_dag_run.execution_date = task_instance.execution_date WHERE
> task_instance.task_id IN ... AND (running_dag_run.dag_id IS NOT NULL OR
> last_dag_run.dag_id IS NOT NULL) GROUP BY task_instance.dag_id,
> task_instance.state;
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)