pjavier29 opened a new issue, #61453:
URL: https://github.com/apache/airflow/issues/61453
### Apache Airflow version
3.1.6
### If "Other Airflow 3 version" selected, which one?
3.1.5
### What happened?
In SchedulerJobRunner._activate_referenced_assets(), the query that builds
active_assets uses a tuple-based IN filter on (name, uri) constructed from a
Python collection (assets). On PostgreSQL 17 with ~16,000 rows in the assets
table, this pattern can generate a very large IN list (many bind parameters),
which significantly increases parse/plan overhead and can make the scheduler
appear to “hang” during asset orphanage/activation processing.
Repo: apache/airflow
File: airflow-core/src/airflow/jobs/scheduler_job_runner.py
Method: SchedulerJobRunner._activate_referenced_assets
DB engine: PostgreSQL 17
Scale: ~16,000 records in assets table (and the in-memory assets collection
can be large)
Problematic pattern: tuple_(AssetActive.name, AssetActive.uri).in_((a.name,
a.uri) for a in assets)
### What you think should happen instead?
Fetching the subset of AssetActive corresponding to referenced assets should
be efficient even when the referenced set is large, without constructing
massive parameter lists from Python.
Use JOIN / EXISTS against a subquery/CTE representing the referenced assets
set, so the set is computed inside the database and avoids Python-materialized
parameter lists.
Conceptual approach: AssetActive JOIN (subselect of referenced assets) ON
(name, uri).
### How to reproduce
This issue is most visible when the referenced-assets set is large, since
the current implementation materializes the set in Python and feeds it back to
Postgres as a giant IN list.
Specific blo
Query specifies blocker:
SELECT airflow.asset_active.name, airflow.asset_active.uri
FROM airflow.asset_active
WHERE (airflow.asset_active.name, airflow.asset_active.uri) IN
(('s3://despegar-data-airflow/prod/dataset/8495.txt',
's3://despegar-data-airflow/prod/dataset/8495.txt'),
('s3://despegar-data-airflow/prod/dataset/10421.txt',
's3://despegar-data-airflow/prod/dataset/10421.txt'), .......
This query block the scheduler and need reboot.
### Operating System
Ubuntu 22.04.4 LTS
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else?
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]