wjddn279 commented on code in PR #55589:
URL: https://github.com/apache/airflow/pull/55589#discussion_r2349308805


##########
airflow-core/src/airflow/models/serialized_dag.py:
##########
@@ -460,7 +460,9 @@ def write_dag(
 
     @classmethod
     def latest_item_select_object(cls, dag_id):
-        return select(cls).where(cls.dag_id == 
dag_id).order_by(cls.created_at.desc()).limit(1)
+        # prevent "Out of sort memory" caused by large values in cls.data 
column
+        latest_item_id = select(cls.id).where(cls.dag_id == 
dag_id).order_by(cls.created_at.desc()).limit(1)
+        return select(cls).where(cls.id == latest_item_id)

Review Comment:
   @ashb Thank you. I had a rough idea that PostgreSQL would not differ much in 
terms of performance, but I verified this by actually running both the original 
and modified queries and comparing their query plans.
   
   The experiment was conducted by running PostgreSQL and Airflow components 
via Breeze.
   The environment was local Docker, using PostgreSQL version 13.
   
   ## AS-IS
   ```
   airflow=# explain SELECT serialized_dag.data,
          serialized_dag.data_compressed,
          serialized_dag.id,
          serialized_dag.dag_id,
          serialized_dag.created_at,
          serialized_dag.last_updated,
          serialized_dag.dag_hash,
          serialized_dag.dag_version_id
   FROM serialized_dag
   WHERE serialized_dag.dag_id = 'example_branch_datetime_operator_3'
   ORDER BY serialized_dag.created_at DESC LIMIT 1;
                                        QUERY PLAN
   
-------------------------------------------------------------------------------------
    Limit  (cost=16.04..16.04 rows=1 width=1394)
      ->  Sort  (cost=16.04..16.04 rows=1 width=1394)
            Sort Key: created_at DESC
            ->  Seq Scan on serialized_dag  (cost=0.00..16.02 rows=1 width=1394)
                  Filter: ((dag_id)::text = 
'example_branch_datetime_operator_3'::text)
   ```
   
   ## TO-BE
   ```
   airflow=# EXPLAIN SELECT serialized_dag.data, 
serialized_dag.data_compressed, serialized_dag.id, serialized_dag.dag_id, 
serialized_dag.created_at, serialized_dag.last_updated, 
serialized_dag.dag_hash, serialized_dag.dag_version_id
   FROM serialized_dag
   WHERE serialized_dag.id = (
       SELECT serialized_dag.id
       FROM serialized_dag
       WHERE serialized_dag.dag_id = 'example_branch_datetime_operator_3'
       ORDER BY serialized_dag.created_at DESC
       LIMIT 1
   );
                                                QUERY PLAN
   
-----------------------------------------------------------------------------------------------------
    Index Scan using serialized_dag_pkey on serialized_dag  (cost=16.18..24.20 
rows=1 width=1394)
      Index Cond: (id = $0)
      InitPlan 1 (returns $0)
        ->  Limit  (cost=16.04..16.04 rows=1 width=24)
              ->  Sort  (cost=16.04..16.04 rows=1 width=24)
                    Sort Key: serialized_dag_1.created_at DESC
                    ->  Seq Scan on serialized_dag serialized_dag_1  
(cost=0.00..16.02 rows=1 width=24)
                          Filter: ((dag_id)::text = 
'example_branch_datetime_operator_3'::text)
   ```
   
   ## Summary
   When comparing the query plans, we can see that the AS-IS query and the 
InitPlan in the TO-BE query are identical.
   The only difference is the additional Index Scan in the TO-BE query, which 
fetches a single row by its primary key. Since an Index Scan for a single 
record retrieval has negligible overhead, it is reasonable to conclude that 
this change does not introduce any meaningful performance impact.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to