jcborras opened a new issue, #25502:
URL: https://github.com/apache/airflow/issues/25502
### Apache Airflow version
2.1.2
### What happened
When loading AirFlow v2 home page subsequent requests include `GET
/last_dagruns`, `GET /dag_stats` and `GET /tasks_stats` which in turn query the
database (PostgreSQL in this case). These DB queries (most likely generated by
an ORM) have the form:
```{sql}
SELECT ...
FROM ..
WHERE dag_id IN ('active_dag1', 'active_dag2, ...., 'active_dag_n`)
```
Which have one issue: unless the postgresql query optimizer figures out that
those strings are actually PKs in some table the `IN` operation is carried out
by doing a linear search which when the list of DAGs and the number of rows in
the tables is high (ie. 300 actives dags and 10M rows) results in response
times in the order of minutes not seconds.
This situation has unintended consequences:
- if the number of HTTP workers is low (ie. 3) it will prevent other users
from loading any page.
- if your monitoring is based on the classic frequent `GET /health` request
you are like to get timeout hence triggering alarms.
### What you think should happen instead
I've made the queries run faster by creating a tmp table storing the values
in the `IN` operator and then `JOIN`ing. That is:
```{sql}
SELECT ..
FROM ..
JOIN tmp_dag_id USING (dag_id)
```
Then the implicit indexes take care of it all. But that of course is a lot
to ask to an ORM.
### How to reproduce
You'll need a setup with a few hundred DAGs. If you run them often the db
will pretty soon contain tables with millions of entries thus making the issue
arise.
### Operating System
Linux
### Versions of Apache Airflow Providers
_No response_
### Deployment
Other Docker-based deployment
### Deployment details
Docker images which operations managed under AWS ECS and Fargate.
### Anything else
N/A
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]