wjddn279 commented on code in PR #55589:
URL: https://github.com/apache/airflow/pull/55589#discussion_r2349308805
##########
airflow-core/src/airflow/models/serialized_dag.py:
##########
@@ -460,7 +460,9 @@ def write_dag(
@classmethod
def latest_item_select_object(cls, dag_id):
- return select(cls).where(cls.dag_id ==
dag_id).order_by(cls.created_at.desc()).limit(1)
+ # prevent "Out of sort memory" caused by large values in cls.data
column
+ latest_item_id = select(cls.id).where(cls.dag_id ==
dag_id).order_by(cls.created_at.desc()).limit(1)
+ return select(cls).where(cls.id == latest_item_id)
Review Comment:
@ashb Thank you. I had a rough idea that PostgreSQL would not differ much in
terms of performance, but I verified this by actually running both the original
and modified queries and comparing their query plans.
The experiment was conducted by running PostgreSQL and Airflow components
via Breeze.
The environment was local Docker, using PostgreSQL version 13.
## AS-IS
```
airflow=# explain SELECT serialized_dag.data,
serialized_dag.data_compressed,
serialized_dag.id,
serialized_dag.dag_id,
serialized_dag.created_at,
serialized_dag.last_updated,
serialized_dag.dag_hash,
serialized_dag.dag_version_id
FROM serialized_dag
WHERE serialized_dag.dag_id = 'example_branch_datetime_operator_3'
ORDER BY serialized_dag.created_at DESC LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=16.04..16.04 rows=1 width=1394)
-> Sort (cost=16.04..16.04 rows=1 width=1394)
Sort Key: created_at DESC
-> Seq Scan on serialized_dag (cost=0.00..16.02 rows=1 width=1394)
Filter: ((dag_id)::text =
'example_branch_datetime_operator_3'::text)
```
## TO-BE
```
airflow=# EXPLAIN SELECT serialized_dag.data,
serialized_dag.data_compressed, serialized_dag.id, serialized_dag.dag_id,
serialized_dag.created_at, serialized_dag.last_updated,
serialized_dag.dag_hash, serialized_dag.dag_version_id
FROM serialized_dag
WHERE serialized_dag.id = (
SELECT serialized_dag.id
FROM serialized_dag
WHERE serialized_dag.dag_id = 'example_branch_datetime_operator_3'
ORDER BY serialized_dag.created_at DESC
LIMIT 1
);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Index Scan using serialized_dag_pkey on serialized_dag (cost=16.18..24.20
rows=1 width=1394)
Index Cond: (id = $0)
InitPlan 1 (returns $0)
-> Limit (cost=16.04..16.04 rows=1 width=24)
-> Sort (cost=16.04..16.04 rows=1 width=24)
Sort Key: serialized_dag_1.created_at DESC
-> Seq Scan on serialized_dag serialized_dag_1
(cost=0.00..16.02 rows=1 width=24)
Filter: ((dag_id)::text =
'example_branch_datetime_operator_3'::text)
```
## Summary
When comparing the query plans, we can see that the AS-IS query and the
InitPlan in the TO-BE query are identical.
The only difference is the additional Index Scan in the TO-BE query, which
fetches a single row by its primary key. Since an Index Scan for a single
record retrieval has negligible overhead, it is reasonable to conclude that
this change does not introduce any meaningful performance impact.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]