VladimirYushkevich commented on issue #35267:
URL: https://github.com/apache/airflow/issues/35267#issuecomment-2112691876

   I managed to spot this problem a bit. When our DAG starts with 1000+ 
Dynamically Mapped Tasks then we are experiencing significant performance 
issues. UI is slow, warnings that scheduler or triggerer are not available. I 
found that only impacted metrics are from `pgbpuncer` and our postgres SQL 
instance(on GCP) started to report about DB locks. Below spikes correlate with 
the time when this DAG with Dynamically Mapped Tasks is running:
   ![Screenshot 2024-05-14 at 20 32 
05](https://github.com/apache/airflow/assets/6008151/e1322dc0-746c-46be-a9b5-106ce6f6871f)
   I selected what are the long running queries from `pg_stat_activity`:
   ![Screenshot 2024-05-14 at 20 57 
44](https://github.com/apache/airflow/assets/6008151/234f9de3-98fb-498f-9de5-ce9a99ab32e5)
   The one of the frequently running query looks like:
   ```
   SELECT dag_run.state AS dag_run_state, dag_run.id AS dag_run_id, 
dag_run.dag_id AS dag_run_dag_id, dag_run.queued_at AS dag_run_queued_at, 
dag_run.execution_date AS dag_run_execution_date, dag_run.start_date AS 
dag_run_start_date, dag_run.end_date AS dag_run_end_date, dag_run.run_id AS 
dag_run_run_id, dag_run.creating_job_id AS dag_run_creating_job_id, 
dag_run.external_trigger AS dag_run_external_trigger, dag_run.run_type AS 
dag_run_run_type, dag_run.conf AS dag_run_conf, dag_run.data_interval_start AS 
dag_run_data_interval_start, dag_run.data_interval_end AS 
dag_run_data_interval_end, dag_run.last_scheduling_decision AS 
dag_run_last_scheduling_decision, dag_run.dag_hash AS dag_run_dag_hash, 
dag_run.log_template_id AS dag_run_log_template_id, dag_run.updated_at AS 
dag_run_updated_at, dag_run.clear_number AS dag_run_clear_number
   FROM dag_run
   WHERE dag_run.dag_id = 'retryable_dag' AND dag_run.run_id = 
'scheduled__2024-05-14T08:00:00+00:00' FOR UPDATE;
   ```
   I tried to run this query with and without `FOR UPDATE`(doesn't really 
matter). When dag is running it takes 3-4 min, most of the time it is waiting 
for the lock. When I pause the DAG exactly the same query takes ~1s.
   We have hundreds of other DAGs running at the same time and haven't seen 
such issue. 
   My suspicion is: running Dynamically Mapped Tasks are the source for locks 
in DB


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to