potiuk commented on a change in pull request #9850:
URL: https://github.com/apache/airflow/pull/9850#discussion_r455781724



##########
File path: airflow/models/serialized_dag.py
##########
@@ -94,13 +97,26 @@ def write_dag(cls, dag: DAG, min_update_interval: 
Optional[int] = None, session=
                      (timezone.utcnow() - 
timedelta(seconds=min_update_interval)) < cls.last_updated))
             ).scalar():
                 return
-        log.debug("Writing DAG: %s to the DB", dag.dag_id)
-        session.merge(cls(dag))
+
+        log.debug("Checking if DAG (%s) is updated or same", dag.dag_id)
+        serialized_dag_from_db: SerializedDagModel = (
+            session
+            .query(cls)
+            .filter(cls.dag_id == dag.dag_id)
+            .one_or_none()
+        )

Review comment:
       In the AirBnB talk they mentioned about really big DAGs and the need to 
use compression in DAG serialization for DB representation of the DAG (they 
implemented it in their fork).  I think adding one extra read here might add 
quite heavy DB throughput. It's an edge case, but possibly an important one and 
some companies already experiencing it.
   
   I know it would be a bit bigger change (involving DB migration) but possibly 
we should calculate hash of the DAG and compare it and read it from the DB 
rather than then DAG itself. That seems like perfectly doable and not that 
complex.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to