jvstein opened a new issue, #61894:
URL: https://github.com/apache/airflow/issues/61894

   ### Apache Airflow version
   
   3.1.7
   
   ### If "Other Airflow 3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   I'm testing the upgrade of an Airflow 2.11.0 instance to 3.1.7 in Kubernetes 
and was observing repeated restarts of the dag-processor during initial start 
up due to failed liveness checks. After removing the liveness probe, the 
processor was able to show the actual errors. The dag-processor was 
successfully starting parses of the individual DAG files (up to the process 
limit), but then each DAG parse would stall. The dag-processor would sit there 
consuming 0% CPU resources and then eventually each DAG parse step would time 
out with a message like this and then a new batch would start and the process 
would repeat:
   
   > Processor for DagFileInfo(rel_path=PosixPath('my/dag/path.py'), 
bundle_name='dags-folder', bundle_path=PosixPath('/opt/airflow/dags'), 
bundle_version=None) with PID 19 started 807 ago killing it.
   
   During this, I observed a very slow query in the upgraded Airflow database.
   
   > SELECT EXISTS (SELECT * 
   FROM task_instance 
   WHERE task_instance.dag_version_id = 
'019c453a-63af-77d4-8a7f-edd1832cae1e'::UUID) AS anon_1
   
   Our `task_instance` table has over 27M rows and the `dag_version_id` column 
is not indexed, so this was taking tens of minutes to do a full table scan.
   
   After patching our instance with this change, the dag-processor was able to 
proceed.
     
https://github.com/apache/airflow/compare/3.1.7...jvstein:airflow:fix_slow_task_instance_query
   
   ### What you think should happen instead?
   
   The query in question does a full table scan on a very large table. There 
appears to be an available `dag_id` value in the function, which should be used 
for filtering in addition to the `dag_version_id` value.
   
   ### How to reproduce
   
   Start with a very large number of records in the `task_instance` table. Run 
the dag-processor. The initial parsing of DAGs should be very slow to proceed.
   
   ### Operating System
   
   Debian bookworm
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==9.21.0
   apache-airflow-providers-apache-iceberg==1.4.1
   apache-airflow-providers-celery==3.15.2
   apache-airflow-providers-cncf-kubernetes==10.12.3
   apache-airflow-providers-common-compat==1.13.0
   apache-airflow-providers-common-io==1.7.1
   apache-airflow-providers-common-sql==1.30.4
   apache-airflow-providers-fab==3.2.0
   apache-airflow-providers-google==19.5.0
   apache-airflow-providers-hashicorp==4.4.3
   apache-airflow-providers-http==5.6.4
   apache-airflow-providers-mysql==6.4.2
   apache-airflow-providers-postgres==6.5.3
   apache-airflow-providers-sendgrid==4.2.1
   apache-airflow-providers-slack==9.6.2
   apache-airflow-providers-smtp==2.4.2
   apache-airflow-providers-standard==1.11.0
   apache-airflow-providers-trino==6.4.2
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Deployed via helm to k8s. Not heavily customized.
   
   ### Anything else?
   
   In our upgraded database with lots of history, this happened every time.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to