[GitHub] [airflow] potiuk commented on a change in pull request #9850: Don't Update Serialized DAGs in DB if DAG didn't change

GitBox Thu, 16 Jul 2020 06:24:10 -0700


potiuk commented on a change in pull request #9850:
URL: https://github.com/apache/airflow/pull/9850#discussion_r455781724




##########
File path: airflow/models/serialized_dag.py
##########
@@ -94,13 +97,26 @@ def write_dag(cls, dag: DAG, min_update_interval: 
Optional[int] = None, session=
                      (timezone.utcnow() - 
timedelta(seconds=min_update_interval)) < cls.last_updated))
             ).scalar():
                 return
-        log.debug("Writing DAG: %s to the DB", dag.dag_id)
-        session.merge(cls(dag))
+
+        log.debug("Checking if DAG (%s) is updated or same", dag.dag_id)
+        serialized_dag_from_db: SerializedDagModel = (
+            session
+            .query(cls)
+            .filter(cls.dag_id == dag.dag_id)
+            .one_or_none()
+        )

Review comment:
       In the AirBnB talk they mentioned about really big DAGs and the need to 
use compression in DAG serialization DB (they implemented it in their fork).  I 
think adding one extra read here might add quite heavy DB throughput. It's an 
edge case, but possibly an important one and some companies already 
experiencing it.
   
   I know it would be a bit bigger change (involving DB migration) but possibly 
we should calculate hash of the DAG and compare it and read it from the DB 
rather than then DAG itself. That seems like perfectly doable and not that 
complex.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a change in pull request #9850: Don't Update Serialized DAGs in DB if DAG didn't change

Reply via email to