SamWheating opened a new issue #21397: URL: https://github.com/apache/airflow/issues/21397
### Apache Airflow version 2.2.2 ### What happened When we upgraded from Airflow 2.1.2 to 2.2.2 we noticed that our MySQL instance was running near 100% CPU utilization while it was closer to 20% on Airflow 2.1.2. After a long investigation, we found that the queries introduced in https://github.com/apache/airflow/pull/17121 are extremely hard on the database, especially when there are multiple schedulers, a high value of `parsing_processes` and a large number of DAGs. Our setup is as follows: - 4x Schedulers - ~20k rows in the `dag` table - ~10k DAG files - `parsing_processes=32` So each time a DAG is parsed, a query like this will be run: ```sql UPDATE dag SET is_active=0 WHERE dag.fileloc = '/path/to_my/dag.py' AND dag.is_active = 1 AND dag.dag_id NOT IN ('my-dag-1', 'my-dag-2'); ``` And because `dag` isn't indexed on `fileloc`, the query ends up doing a full-table-scan (or nearly a full-table-scan), and this is repeated for every single file which is processed. When I added these queries, I tested the change in a local `breeze` environment with a relatively small number of DAGs and thus did not notice the performance implications. At our scale / configuration, we have approximately 128 of these poorly-performant queries running in parallel, each scanning approximately 20,000 rows. Understandably this was really hard on the database which ended up drastically impacting the performance of other queries. We were able to reduce the impact by lowering the `parsing_processes`, cleaning up old entries in the `dag` table and increasing the `min_file_processing_interval`, but none of these mitigations really address the root of the problem. We are currently working on a fix which moves this cleanup to the DAG Processor manager and eliminates un-indexed queries and should be able to submit a preliminary pull request for review in the next few days. ### What you expected to happen Removing DAGs which no longer exist in files should not put so much strain on the database. ### How to reproduce _No response_ ### Operating System Debian GNU/Linux 10 (buster) ### Versions of Apache Airflow Providers _No response_ ### Deployment Other 3rd-party Helm chart ### Deployment details _No response_ ### Anything else _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
