pjavier29 opened a new issue, #61453:
URL: https://github.com/apache/airflow/issues/61453

   ### Apache Airflow version
   
   3.1.6
   
   ### If "Other Airflow 3 version" selected, which one?
   
   3.1.5
   
   ### What happened?
   
   In SchedulerJobRunner._activate_referenced_assets(), the query that builds 
active_assets uses a tuple-based IN filter on (name, uri) constructed from a 
Python collection (assets). On PostgreSQL 17 with ~16,000 rows in the assets 
table, this pattern can generate a very large IN list (many bind parameters), 
which significantly increases parse/plan overhead and can make the scheduler 
appear to “hang” during asset orphanage/activation processing.
   
   Repo: apache/airflow
   File: airflow-core/src/airflow/jobs/scheduler_job_runner.py
   Method: SchedulerJobRunner._activate_referenced_assets
   DB engine: PostgreSQL 17
   Scale: ~16,000 records in assets table (and the in-memory assets collection 
can be large)
   Problematic pattern: tuple_(AssetActive.name, AssetActive.uri).in_((a.name, 
a.uri) for a in assets)
   
   ### What you think should happen instead?
   
   Fetching the subset of AssetActive corresponding to referenced assets should 
be efficient even when the referenced set is large, without constructing 
massive parameter lists from Python.
   
   Use JOIN / EXISTS against a subquery/CTE representing the referenced assets 
set, so the set is computed inside the database and avoids Python-materialized 
parameter lists.
   Conceptual approach: AssetActive JOIN (subselect of referenced assets) ON 
(name, uri).
   
   ### How to reproduce
   
   This issue is most visible when the referenced-assets set is large, since 
the current implementation materializes the set in Python and feeds it back to 
Postgres as a giant IN list.
   Specific blo
   
   Query specifies blocker:
   SELECT airflow.asset_active.name, airflow.asset_active.uri
   FROM airflow.asset_active
   WHERE (airflow.asset_active.name, airflow.asset_active.uri) IN 
(('s3://despegar-data-airflow/prod/dataset/8495.txt', 
's3://despegar-data-airflow/prod/dataset/8495.txt'), 
('s3://despegar-data-airflow/prod/dataset/10421.txt', 
's3://despegar-data-airflow/prod/dataset/10421.txt'), .......
   
   This query block the scheduler and need reboot.
   
   ### Operating System
   
   Ubuntu 22.04.4 LTS
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to