SamWheating opened a new issue #21397:
URL: https://github.com/apache/airflow/issues/21397


   ### Apache Airflow version
   
   2.2.2
   
   ### What happened
   
   When we upgraded from Airflow 2.1.2 to 2.2.2 we noticed that our MySQL 
instance was running near 100% CPU utilization while it was closer to 20% on 
Airflow 2.1.2.
   
   After a long investigation, we found that the queries introduced in 
https://github.com/apache/airflow/pull/17121 are extremely hard on the 
database, especially when there are multiple schedulers, a high value of 
`parsing_processes` and a large number of DAGs. 
   
   Our setup is as follows:
    - 4x Schedulers
    - ~20k rows in the `dag` table
    - ~10k DAG files
    - `parsing_processes=32`
    
   So each time a DAG is parsed, a query like this will be run:
   ```sql
   UPDATE dag
   SET is_active=0
   WHERE dag.fileloc = '/path/to_my/dag.py'
   AND dag.is_active = 1
   AND dag.dag_id NOT IN ('my-dag-1', 'my-dag-2');
   ```
   
   And because `dag` isn't indexed on `fileloc`, the query ends up doing a 
full-table-scan (or nearly a full-table-scan), and this is repeated for every 
single file which is processed. 
   
   When I added these queries, I tested the change in a local `breeze` 
environment with a relatively small number of DAGs and thus did not notice the 
performance implications.
   
   At our scale / configuration, we have approximately 128 of these 
poorly-performant queries running in parallel, each scanning approximately 
20,000 rows. Understandably this was really hard on the database which ended up 
drastically impacting the performance of other queries. 
   
   We were able to reduce the impact by lowering the `parsing_processes`, 
cleaning up old entries in the `dag` table and increasing the 
`min_file_processing_interval`, but none of these mitigations really address 
the root of the problem. 
   
   We are currently working on a fix which moves this cleanup to the DAG 
Processor manager and eliminates un-indexed queries and should be able to 
submit a preliminary pull request for review in the next few days. 
   
   ### What you expected to happen
   
   Removing DAGs which no longer exist in files should not put so much strain 
on the database. 
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to